程式扎記: [ InAction Note ] Ch6. Extending search

Preface:
If all the information needed to perform filtering is in the index, there’s no need to write your own filter because the QueryWrapperFilter can handle it, as described insection 5.6.5. But there are good reasons to factor external information into a custom filter. In this section we tackle the following example: using our book example data and pretending we’re running an online bookstore, we want users to be able to search within our special hot deals of the day.

You might be tempted to simply store the specials flag as an indexed field, but keeping this up-to-date might prove too costly. Rather than reindex entire documents when specials change, we’ll implement a custom filter that keeps the specials flagged in our (hypothetical) relational database. Then we’ll see how to apply our filter during searching, and finally we’ll explore an alternative option for applying the filter.

Implementing a custom filter
We start with abstracting away the source of our specials by defining this interface:

view plaincopy to clipboardprint?
package demo.ch6;  
  
public interface SpecialsAccessor {  
    String[] isbns();  
}  

The isbns() method returns those books that are currently specials. Because we won’t have an enormous amount of specials at one time, returning all the ISBNs of the books on special will suffice. Now that we have a retrieval interface, we can create our custom filter, SpecialsFilter. Filters extend from theorg.apache.lucene.search.Filter class and must implement the getDocIdSet(IndexReader reader) method, returning a DocIdSet. Bit positions match the document numbers. Enabled bits mean the document for that position is available to be searched against the query, and unset bits mean the document won’t be considered in the search. Figure 6.2 illustrates an example SpecialsFilter that sets bits for books on special (see listing 6.14).

- Listing 6.14 Retrieving filter information from external source with SpecialsFilter

view plaincopy to clipboardprint?
import java.io.IOException;  
  
import org.apache.lucene.index.AtomicReader;  
import org.apache.lucene.index.AtomicReaderContext;  
import org.apache.lucene.index.DocsEnum;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.search.DocIdSet;  
import org.apache.lucene.search.DocIdSetIterator;  
import org.apache.lucene.search.Filter;  
import org.apache.lucene.util.Bits;  
import org.apache.lucene.util.OpenBitSet;  
  
public class SpecialsFilter extends Filter {  
    private SpecialsAccessor accessor;  
  
    public SpecialsFilter(SpecialsAccessor accessor) {  
        this.accessor = accessor;  
    }  
  
    @Override  
    public DocIdSet getDocIdSet(AtomicReaderContext ctx, Bits bits)  
            throws IOException {  
        AtomicReader reader = ctx.reader();  
        OpenBitSet oBits = new OpenBitSet(reader.maxDoc());  
        String[] isbns = accessor.isbns();  
        for (String isbn : isbns)   
        {  
            DocsEnum docEnum = reader.termDocsEnum(new Term("isbn", isbn));  
            while(docEnum.nextDoc()!= DocIdSetIterator.NO_MORE_DOCS)  
            {  
                if(docEnum.freq()>0)  
                {  
                    oBits.set(docEnum.docID());  
                }  
            }  
        }  
        return oBits;  
    }  
}  

The filter is quite straightforward. First we fetch the ISBNs of the current specials. Next, we interact with the AtomicReader API to iterate over all documents matching each ISBN; in each case it should be a single document per ISBN because this is a unique field. The document was indexed with Field.Index.NOT_ANALYZED, so we can retrieve it directly with the ISBN. Finally, we record each matching document in an OpenBitSet, which we return to Lucene. Let’s test our filter during searching.

Using our custom filter during searching
To test that our filter is working, we created a simple TestSpecialsAccessor to return a specified set of ISBNs, giving our test case control over the set of specials:

view plaincopy to clipboardprint?
public class TestSpecialsAccessor implements SpecialsAccessor {  
    private String[] isbns;  
  
    public TestSpecialsAccessor(String[] isbns) {  
        this.isbns = isbns;  
    }  
  
    public String[] isbns() {  
        return isbns;  
    }  
}  

Here’s how we test our SpecialsFilter, using the same setUp() that the other filter tests used:

view plaincopy to clipboardprint?
public void testCustomFilter() throws Exception {  
    Query allBooks = new TermQuery(new Term("contents", "manning"));  
    String[] isbns = new String[] { "1933988940", "9781935182023" };  
    SpecialsAccessor accessor = new TestSpecialsAccessor(isbns);  
    Filter filter = new SpecialsFilter(accessor);  
    TopDocs hits = searcher.search(allBooks, filter, 10);  
    assertEquals("the specials", isbns.length, hits.totalHits);  
}  

Note that we made an important implementation decision not to cache the DocIdSet in SpecialsFilter. Decorating SpecialsFilter with a CachingWrapperFilter frees us from that aspect. Let’s see an alternative means of applying a filter during searching.

An alternative: FilteredQuery
To add to the filter terminology overload, one final option is FilteredQuery. FilteredQuery inverts the situation that searching with a filter presents. Using a filter, anIndexSearcher’s search method applies a single filter during querying. Using the FilteredQuery, though, you can turn any filter into a query, which opens up neat possibilities, such as adding a filter as a clause to a BooleanQuery.

Let’s take the SpecialsFilter as an example again. This time, we want a more sophisticated query: books in an education category on special, or books on Logo. We couldn’t accomplish this with a direct query using the techniques shown thus far, but FilteredQuery makes this possible. Had our search been only for books in the education category on special, we could’ve used the technique shown in the previous code snippet instead.

Our test case, in listing 6.15, demonstrates the described query using a BooleanQuery with a nested TermQuery and FilteredQuery.

Listing 6.15 Using a FilteredQuery

view plaincopy to clipboardprint?
public void testFilteredQuery() throws Exception {  
    // 1)  
    String[] isbns = new String[] { "9781935182023" };        
      
    // 2)  
    SpecialsAccessor accessor = new TestSpecialsAccessor(isbns);          
    Filter filter = new SpecialsFilter(accessor);  
    WildcardQuery educationBooks = new WildcardQuery(new Term("category", "*education*"));  
    FilteredQuery edBooksOnSpecial = new FilteredQuery(educationBooks, filter);  
      
    // 3)  
    TermQuery logoBooks = new TermQuery(new Term("subject", "logo"));  
      
    // 4)  
    BooleanQuery logoOrEdBooks = new BooleanQuery();  
    logoOrEdBooks.add(logoBooks, BooleanClause.Occur.SHOULD);  
    logoOrEdBooks.add(edBooksOnSpecial, BooleanClause.Occur.SHOULD);  
    TopDocs hits = searcher.search(logoOrEdBooks, 10);  
    System.out.println(logoOrEdBooks.toString());  
    assertEquals("Papert and Steiner", 2, hits.totalHits);  
}  

1) This is the ISBN number for filtering
2) We construct a query for education books on special.
3) We construct a query for all books with logo in the subject.
4) The two queries are combined in an OR fashion.

The getDocIdSet() method of the nested Filter is called each time a FilteredQuery is used in a search, so we recommend that you use a caching filter if the query is to be used repeatedly and the results of a filter don’t change.

Filtering is a powerful means of overriding which documents a query may match, and in this section you’ve seen how to create custom filters and use them during searching, as well as how to wrap a filter as a query so that it may be used wherever a query may be used. Filters give you a lot of flexibility for advanced searching.

程式扎記

標籤

2014年7月23日星期三

[ InAction Note ] Ch6. Extending search - Custom filters

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2014年7月23日 星期三