Filtering is a mechanism of narrowing the search space, allowing only a subset of the documents to be considered as possible hits. They can be used to implement search-within-search features to successively search within a previous set of results or to constrain the document search space. A security filter allows users to only see search results of documents they “own,” even if their query technically matches other documents that are off limits; we provide an example of a security filter in section 5.6.7.
You can filter any Lucene search using the overloaded search methods that accept a Filter instance. There are numerous built-in filter implementations:
- TermRangeFilter
- NumericRangeFilter
- FieldCacheRangeFilter
- FieldCacheTermsFilter
- QueryWrapperFilter
- SpanQueryFilter
- PrefixFilter
- CachingWrapperFilter
- CachingSpanFilter
- FilteredDocIdSet
Before you get concerned about mentions of caching results, rest assured that it’s done with a tiny data structure (a DocIdBitSet) where each bit position represents a document. Consider also the alternative to using a filter: aggregating required clauses in a BooleanQuery. In this section, we’ll discuss each of the built-in filters as well as the BooleanQuery alternative.
TermRangeFilter:
TermRangeFilter filters on a range of terms in a specific field, just like TermRangeQuery minus the scoring. If the field is numeric, you should use NumericRangeFilter(described next) instead. TermRangeFilter applies to textual fields.
Let’s look at title filtering as an example, shown in listing 5.12. We use the MatchAllDocsQuery as our query, and then apply a title filter to it.
- Listing 5.12 Using TermRangeFilter to filter by title
- package ch5;
- import john.utils.TestUtil;
- import junit.framework.TestCase;
- import org.apache.lucene.analysis.SimpleAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.Field;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.index.IndexWriter;
- import org.apache.lucene.index.IndexWriterConfig;
- import org.apache.lucene.search.Filter;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.MatchAllDocsQuery;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.TermRangeFilter;
- import org.apache.lucene.store.Directory;
- import org.apache.lucene.store.RAMDirectory;
- import org.apache.lucene.util.Version;
- public class FilterTest extends TestCase {
- private Query allBooks;
- private IndexSearcher searcher;
- public static Version LUCENE_VERSION = Version.LUCENE_30;
- private SimpleAnalyzer analyzer = null;
- @Override
- protected void setUp() throws Exception {
- Directory directory = new RAMDirectory();
- analyzer = new SimpleAnalyzer(LUCENE_VERSION);
- IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION,
- analyzer);
- IndexWriter writer = new IndexWriter(directory, iwConfig);
- Document doc = new Document();
- doc.add(new Field("title2", "the quick brown fox jumps over the lazy dog",
- Field.Store.YES, Field.Index.ANALYZED));
- writer.addDocument(doc);
- doc = new Document();
- doc.add(new Field("title2", "the quick red fox jumps over the sleepy cat",
- Field.Store.YES, Field.Index.ANALYZED));
- writer.addDocument(doc);
- doc = new Document();
- doc.add(new Field("title2", "abc tutorial",
- Field.Store.YES, Field.Index.ANALYZED));
- writer.addDocument(doc);
- writer.close();
- IndexReader reader = IndexReader.open(directory);
- searcher = new IndexSearcher(reader);
- allBooks = new MatchAllDocsQuery();
- }
- @Override
- protected void tearDown() throws Exception {
- searcher.close();
- }
- public void testTermRangeFilter() throws Exception {
- Filter filter = new TermRangeFilter("title2", "d", "j", true, true);
- assertEquals(2, TestUtil.hitCount(searcher, allBooks, filter));
- }
- }
OPEN-ENDED RANGE FILTERING
TermRangeFilter also supports open-ended ranges. To filter on ranges with one end of the range specified and the other end open, just pass null for whichever end should be open:
- filter = new TermRangeFilter("modified", null, jan31, false, true);
- filter = new TermRangeFilter("modified", jan1, null, true, false);
- filter = TermRangeFilter.Less("modified", jan31);
- filter = TermRangeFilter.More("modified", jan1);
NumericRangeFilter filters by numeric value. This is just like NumericRangeQuery, minus the constant scoring:
- ...
- @Override
- protected void setUp() throws Exception {
- NumericField nf = new NumericField("pubmonth");
- Directory directory = new RAMDirectory();
- analyzer = new SimpleAnalyzer(LUCENE_VERSION);
- IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION,
- analyzer);
- IndexWriter writer = new IndexWriter(directory, iwConfig);
- writer.deleteAll();
- Document doc = new Document();
- doc.add(new Field("title2", "the quick brown fox jumps over the lazy dog",
- Field.Store.YES, Field.Index.ANALYZED));
- nf.setIntValue(201003);
- doc.add(nf);
- writer.addDocument(doc);
- doc = new Document();
- doc.add(new Field("title2", "the quick red fox jumps over the sleepy cat",
- Field.Store.YES, Field.Index.ANALYZED));
- nf.setIntValue(201105);
- doc.add(nf);
- writer.addDocument(doc);
- doc = new Document();
- doc.add(new Field("title2", "abc tutorial",
- Field.Store.YES, Field.Index.ANALYZED));
- nf.setIntValue(201006);
- doc.add(nf);
- writer.addDocument(doc);
- writer.close();
- IndexReader reader = IndexReader.open(directory);
- searcher = new IndexSearcher(reader);
- allBooks = new MatchAllDocsQuery();
- }
- ...
- public void testNumericDateFilter() throws Exception {
- Filter filter = NumericRangeFilter.newIntRange("pubmonth", 201001, 201006, true, true);
- assertEquals(2, TestUtil.hitCount(searcher, allBooks, filter));
- }
FieldCacheRangeFilter:
FieldCacheRangeFilter is another option for range filtering. It achieves exactly the same filtering as both TermRangeFilter and NumericRangeFilter, but does so by using Lucene’s field cache. This may result in faster performance in certain situations, since all values are preloaded into memory. But the usual caveats with field cache apply, as described in section 5.1.
This filter exposes a different API to achieve range filtering. Here’s how to do the same filtering on title2 that we did with TermRangeFilter:
- Filter filter = FieldCacheRangeFilter.newStringRange("title2", "d", "j", true, true);
- assertEquals(2, TestUtil.hitCount(searcher, allBooks, filter));
- filter = FieldCacheRangeFilter.newIntRange("pubmonth",
- 201001,
- 201006,
- true,
- true);
- assertEquals(2, TestUtil.hitCount(searcher, allBooks, filter));
- Filtering by specific terms
Sometimes you’d simply like to select specific terms to include in your filter. For example, perhaps your documents have Country as a field, and your search interface presents a checkbox allowing the user to pick and choose which countries to include in the search. There are two ways to achieve this.
The first approach is FieldCacheTermsFilter, which uses field cache under the hood. (Be sure to read section 5.1 for the trade-offs of the field cache.) Simply instantiate it with the field (String) and an array of String:
- @Override
- protected void setUp() throws Exception {
- NumericField nf = new NumericField("pubmonth");
- Directory directory = new RAMDirectory();
- analyzer = new SimpleAnalyzer(LUCENE_VERSION);
- IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION,
- analyzer);
- IndexWriter writer = new IndexWriter(directory, iwConfig);
- writer.deleteAll();
- Document doc = new Document();
- doc.add(new Field("title2", "the quick brown fox jumps over the lazy dog",
- Field.Store.YES, Field.Index.ANALYZED));
- doc.add(new Field("category", "dog", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
- nf.setIntValue(201003);
- doc.add(nf);
- writer.addDocument(doc);
- doc = new Document();
- doc.add(new Field("title2", "the quick red fox jumps over the sleepy cat",
- Field.Store.YES, Field.Index.ANALYZED));
- doc.add(new Field("category", "cat", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
- nf.setIntValue(201105);
- doc.add(nf);
- writer.addDocument(doc);
- doc = new Document();
- doc.add(new Field("title2", "abc tutorial",
- Field.Store.YES, Field.Index.ANALYZED));
- doc.add(new Field("category", "tutorial", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
- nf.setIntValue(201006);
- doc.add(nf);
- writer.addDocument(doc);
- writer.close();
- IndexReader reader = IndexReader.open(directory);
- searcher = new IndexSearcher(reader);
- allBooks = new MatchAllDocsQuery();
- }
- public void testFieldCacheTermsFilter() throws Exception {
- Filter filter = new FieldCacheTermsFilter("category", new String[] {
- "cat",
- "dog" });
- assertEquals("expected 2 hits", 2,
- TestUtil.hitCount(searcher, allBooks, filter));
- }
The second approach for filtering by terms is TermsFilter, which is included in Lucene’s contrib modules and is described in more detail in section 8.6.4. TermsFilterdoesn’t do any internal caching, and it allows filtering on fields that have more than one term; otherwise, TermsFilter and FieldCacheTermsFilter are functionally identical. It’s best to test both approaches for your application to see if there are any significant performance differences.
沒有留言:
張貼留言