程式扎記: [ InAction Note ] Ch5. Advanced search techniques

Preface:
Filtering is a mechanism of narrowing the search space, allowing only a subset of the documents to be considered as possible hits. They can be used to implement search-within-search features to successively search within a previous set of results or to constrain the document search space. A security filter allows users to only see search results of documents they “own,” even if their query technically matches other documents that are off limits; we provide an example of a security filter in section 5.6.7.

You can filter any Lucene search using the overloaded search methods that accept a Filter instance. There are numerous built-in filter implementations:
- TermRangeFilter

TermRangeFilter matches only documents containing terms within a specified range of terms. It’s exactly the same as TermRangeQuery, without scoring.

- NumericRangeFilter

NumericRangeFilter matches only documents containing numeric values within a specified range for a specified field. It’s exactly the same as NumericRangeQuery, without scoring.

- FieldCacheRangeFilter

FieldCacheRangeFilter matches documents in a certain term or numeric range, using the FieldCache (see section 5.1) for better performance.

- FieldCacheTermsFilter

FieldCacheTermsFilter matches documents containing specific terms, using the field cache for better performance.

- QueryWrapperFilter

QueryWrapperFilter turns any Query instance into a Filter instance, by using only the matching documents from the Query as the filtered space, discarding the document scores.

- SpanQueryFilter

SpanQueryFilter turns a SpanQuery into a SpanFilter, which subclasses the base Filter class and adds an additional method, providing access to the positional spans for each matching document. This is just like QueryWrapperFilter but is applied to SpanQuery classes instead.

- PrefixFilter

PrefixFilter matches only documents containing terms in a specific field with a specific prefix. It’s exactly the same as PrefixQuery, without scoring.

- CachingWrapperFilter

CachingWrapperFilter is a decorator over another filter, caching its results to increase performance when used again.

- CachingSpanFilter

CachingSpanFilter does the same thing as CachingWrapperFilter, but it caches a SpanFilter.

- FilteredDocIdSet

FilteredDocIdSet allows you to filter a filter, one document at a time. In order to use it, you must first subclass it and define the match method in your subclass.

Before you get concerned about mentions of caching results, rest assured that it’s done with a tiny data structure (a DocIdBitSet) where each bit position represents a document. Consider also the alternative to using a filter: aggregating required clauses in a BooleanQuery. In this section, we’ll discuss each of the built-in filters as well as the BooleanQuery alternative.

TermRangeFilter:
TermRangeFilter filters on a range of terms in a specific field, just like TermRangeQuery minus the scoring. If the field is numeric, you should use NumericRangeFilter(described next) instead. TermRangeFilter applies to textual fields.

Let’s look at title filtering as an example, shown in listing 5.12. We use the MatchAllDocsQuery as our query, and then apply a title filter to it.

- Listing 5.12 Using TermRangeFilter to filter by title

view plaincopy to clipboardprint?
package ch5;  
  
import john.utils.TestUtil;  
import junit.framework.TestCase;  
  
import org.apache.lucene.analysis.SimpleAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.IndexWriterConfig;  
import org.apache.lucene.search.Filter;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.MatchAllDocsQuery;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.TermRangeFilter;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.RAMDirectory;  
import org.apache.lucene.util.Version;  
  
public class FilterTest extends TestCase {  
    private Query allBooks;  
    private IndexSearcher searcher;  
    public static Version LUCENE_VERSION = Version.LUCENE_30;  
    private SimpleAnalyzer analyzer = null;  
  
    @Override  
    protected void setUp() throws Exception {  
        Directory directory = new RAMDirectory();  
        analyzer = new SimpleAnalyzer(LUCENE_VERSION);  
        IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION,  
                analyzer);  
        IndexWriter writer = new IndexWriter(directory, iwConfig);  
  
        Document doc = new Document();  
        doc.add(new Field("title2", "the quick brown fox jumps over the lazy dog",  
                Field.Store.YES, Field.Index.ANALYZED));  
        writer.addDocument(doc);  
        doc = new Document();  
        doc.add(new Field("title2", "the quick red fox jumps over the sleepy cat",  
                Field.Store.YES, Field.Index.ANALYZED));  
        writer.addDocument(doc);  
        doc = new Document();  
        doc.add(new Field("title2", "abc tutorial",  
                Field.Store.YES, Field.Index.ANALYZED));  
        writer.addDocument(doc);  
        writer.close();  
        IndexReader reader = IndexReader.open(directory);  
        searcher = new IndexSearcher(reader);  
        allBooks = new MatchAllDocsQuery();  
    }  
  
    @Override  
    protected void tearDown() throws Exception {  
        searcher.close();  
    }  
  
    public void testTermRangeFilter() throws Exception {  
        Filter filter = new TermRangeFilter("title2", "d", "j", true, true);  
        assertEquals(2, TestUtil.hitCount(searcher, allBooks, filter));  
    }  
}  

在上面 setUp() 方法中, 我們添加了 3 個文件. 在 testTermRangeFilter() 中只會有兩個文件被比對到, 而文件 "abc tutorial" 不會被比對到是因為它的每一個 term 都不在 'd'~'j' 的 range 裡面 (只要有一個 term 滿足, 該文件便會被比對到).

OPEN-ENDED RANGE FILTERING
TermRangeFilter also supports open-ended ranges. To filter on ranges with one end of the range specified and the other end open, just pass null for whichever end should be open:

view plaincopy to clipboardprint?
filter = new TermRangeFilter("modified", null, jan31, false, true);  
filter = new TermRangeFilter("modified", jan1, null, true, false);  

TermRangeFilter provides two static convenience methods to achieve the same thing:

view plaincopy to clipboardprint?
filter = TermRangeFilter.Less("modified", jan31);  
filter = TermRangeFilter.More("modified", jan1);  

NumericRangeFilter :
NumericRangeFilter filters by numeric value. This is just like NumericRangeQuery, minus the constant scoring:

view plaincopy to clipboardprint?
...  
    @Override  
    protected void setUp() throws Exception {  
        NumericField nf = new NumericField("pubmonth");       
        Directory directory = new RAMDirectory();  
        analyzer = new SimpleAnalyzer(LUCENE_VERSION);  
        IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION,  
                analyzer);  
        IndexWriter writer = new IndexWriter(directory, iwConfig);  
        writer.deleteAll();  
        Document doc = new Document();  
        doc.add(new Field("title2", "the quick brown fox jumps over the lazy dog",  
                Field.Store.YES, Field.Index.ANALYZED));  
        nf.setIntValue(201003);  
        doc.add(nf);  
        writer.addDocument(doc);  
        doc = new Document();  
        doc.add(new Field("title2", "the quick red fox jumps over the sleepy cat",  
                Field.Store.YES, Field.Index.ANALYZED));  
        nf.setIntValue(201105);  
        doc.add(nf);  
        writer.addDocument(doc);  
        doc = new Document();  
        doc.add(new Field("title2", "abc tutorial",  
                Field.Store.YES, Field.Index.ANALYZED));  
        nf.setIntValue(201006);  
        doc.add(nf);  
        writer.addDocument(doc);  
        writer.close();  
        IndexReader reader = IndexReader.open(directory);  
        searcher = new IndexSearcher(reader);  
        allBooks = new MatchAllDocsQuery();  
    }  
...  
    public void testNumericDateFilter() throws Exception {  
        Filter filter = NumericRangeFilter.newIntRange("pubmonth", 201001, 201006, true, true);  
        assertEquals(2, TestUtil.hitCount(searcher, allBooks, filter));  
    }  

上面文件只有在 "pubmonth" 欄位坐落在 [201001, 201006] 範圍內才會被選中. The same caveats as NumericRangeQuery apply here; for example, if you specify a precisionStepdifferent from the default, it must match the precisionStep used during indexing.

FieldCacheRangeFilter:
FieldCacheRangeFilter is another option for range filtering. It achieves exactly the same filtering as both TermRangeFilter and NumericRangeFilter, but does so by using Lucene’s field cache. This may result in faster performance in certain situations, since all values are preloaded into memory. But the usual caveats with field cache apply, as described in section 5.1.

This filter exposes a different API to achieve range filtering. Here’s how to do the same filtering on title2 that we did with TermRangeFilter:

view plaincopy to clipboardprint?
Filter filter = FieldCacheRangeFilter.newStringRange("title2", "d", "j", true, true);  
assertEquals(2, TestUtil.hitCount(searcher, allBooks, filter));  

To achieve the same filtering that we did with NumericRangeFilter:

view plaincopy to clipboardprint?
filter = FieldCacheRangeFilter.newIntRange("pubmonth",  
                                           201001,  
                                           201006,  
                                           true,  
                                           true);  
assertEquals(2, TestUtil.hitCount(searcher, allBooks, filter));  

Let’s see how to filter by an arbitrary set of terms.

- Filtering by specific terms
Sometimes you’d simply like to select specific terms to include in your filter. For example, perhaps your documents have Country as a field, and your search interface presents a checkbox allowing the user to pick and choose which countries to include in the search. There are two ways to achieve this.

The first approach is FieldCacheTermsFilter, which uses field cache under the hood. (Be sure to read section 5.1 for the trade-offs of the field cache.) Simply instantiate it with the field (String) and an array of String:

view plaincopy to clipboardprint?
    @Override  
    protected void setUp() throws Exception {  
    NumericField nf = new NumericField("pubmonth");       
    Directory directory = new RAMDirectory();  
    analyzer = new SimpleAnalyzer(LUCENE_VERSION);  
    IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION,  
            analyzer);  
    IndexWriter writer = new IndexWriter(directory, iwConfig);  
    writer.deleteAll();  
    Document doc = new Document();  
    doc.add(new Field("title2", "the quick brown fox jumps over the lazy dog",  
            Field.Store.YES, Field.Index.ANALYZED));  
    doc.add(new Field("category", "dog", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));  
    nf.setIntValue(201003);  
    doc.add(nf);  
    writer.addDocument(doc);  
    doc = new Document();  
    doc.add(new Field("title2", "the quick red fox jumps over the sleepy cat",  
            Field.Store.YES, Field.Index.ANALYZED));  
    doc.add(new Field("category", "cat", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));  
    nf.setIntValue(201105);  
    doc.add(nf);  
    writer.addDocument(doc);  
    doc = new Document();  
    doc.add(new Field("title2", "abc tutorial",  
            Field.Store.YES, Field.Index.ANALYZED));  
    doc.add(new Field("category", "tutorial", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));  
    nf.setIntValue(201006);  
    doc.add(nf);  
    writer.addDocument(doc);  
    writer.close();  
    IndexReader reader = IndexReader.open(directory);  
    searcher = new IndexSearcher(reader);  
    allBooks = new MatchAllDocsQuery();  
}  
  
public void testFieldCacheTermsFilter() throws Exception {  
    Filter filter = new FieldCacheTermsFilter("category", new String[] {  
            "cat",  
            "dog" });  
    assertEquals("expected 2 hits", 2,  
            TestUtil.hitCount(searcher, allBooks, filter));  
}  

All documents that have any of the terms in the specified field will be accepted. Note that the documents must have a single term value for each field. Under the hood, this filter loads all terms for all documents into the field cache the first time it’s used during searching for a given field. This means the first search will be slower, but subsequent searches, which reuse the cache, will be very fast. The field cache is reused even if you change which specific terms are included in the filter.

The second approach for filtering by terms is TermsFilter, which is included in Lucene’s contrib modules and is described in more detail in section 8.6.4. TermsFilterdoesn’t do any internal caching, and it allows filtering on fields that have more than one term; otherwise, TermsFilter and FieldCacheTermsFilter are functionally identical. It’s best to test both approaches for your application to see if there are any significant performance differences.

程式扎記

標籤

2013年6月2日星期日

[ InAction Note ] Ch5. Advanced search techniques - Filtering a search (1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年6月2日 星期日