程式扎記: [ InAction Note ] Ch6. Extending search - Developing a custom Collector

標籤

2014年7月14日 星期一

[ InAction Note ] Ch6. Extending search - Developing a custom Collector

Preface:
In most applications with full-text search, users are looking for the top documents when sorting by either relevance or field values. The most common usage pattern is such that only these ScoreDocs are visited. In some scenarios, though, users want more control over precisely which documents should be retained during searching.

Lucene allows full customization of what you do with each matching document if you create your own subclass of the abstract Collector base class. For example, perhaps you wish to gather every single document ID that matched the query. Or perhaps with each matched document you’d like to consult its contents or an external resource to collate additional information. We’ll cover both of these examples in this section.

Developing a custom Collector
You might be tempted to run a normal search, with a very large numHits, and then postprocess the results. This strategy will work, but it’s an exceptionally inefficient approach because these methods are spending sizable CPU computing scores, which you may not need, and performing sorting, which you also may not need. Using a custom Collector class avoids these costs.

We begin by delving into the methods that make up the custom Collector API (see table 6.1).


The Collector base class
Collector is an abstract base class that defines the API that Lucene interacts with while doing searching. As with the FieldComparator API for custom sorting,Collector’s API is more complex than you’d expect, in order to enable high-performance hit collection. Table 6.1 shows the four methods with a brief summary.

All of Lucene’s core search methods use a Collector subclass under the hood to do their collection. For example, when sorting by relevance, TopScoreDocCollector is used. When sorting by field, it’s TopFieldCollector. Both of these are public classes in the org.apache.lucene.search package, and you can instantiate them yourself if needed. Likely your application can use one of these classes, or subclass TopDocsCollector, instead of implementing Collector directly.

During searching, when Lucene finds a matching document, it calls the Collector’s collect(int docID) method. Lucene couldn’t care less what’s done with the document; it’s up to the Collector to record the match, if it wants. This is the hot spot of searching, so make sure your collect method does only the bare minimum work required.

Lucene drives searching one segment at a time, for higher performance, and notifies you of each segment transition by calling the setNextReader(AtomicReaderContext context). The provided IndexReader (context.reader()) is specific to the segment. It will be a different instance for each segment. It’s important for the Collector to record the docBase at this point, because the docID provided to the collect method is relative within each segment. To get the absolute or global docID, you must add docBase to it. This method is also the place to do any segment-specific initialization required by your collector. For example, you could use the FieldCache API, described in section 5.1, to retrieve values corresponding to the provided IndexReader.

Note that the relevance score isn’t passed to the collect method. This saves wasted CPU for Collectors that don’t require it. Instead, Lucene calls the setScorer(Scorer)method on the Collector, once per segment in the index, to provide a Scorer instance. You should hold onto this Scorer, if needed, and then retrieve the relevance score of the currently matched document by calling Scorer.score(). That method must be called from within the collect method because it holds volatile data specific to the current docID being collected. Note that Scorer.score() will recompute the score every time, so if your collect method may invoke score multiple times, you should call it once internally and simply reuse the returned result. Alternatively, Lucene provides the ScoreCachingWrapperScorer, which is a Scorer implementation that caches the score per document. Note also that Scorer is a rich and advanced API in and of itself, but in this context you should only use the score method.

The final method, acceptsDocsOutOfOrder(), which returns a Boolean, is invoked by Lucene to see whether your Collector can tolerate docIDs that arrive out of sorted order. Many collectors can, but some collectors either can’t accept docIDs out or order, or would have to do too much extra work. If possible, you should return true, because certain BooleanQuery instances can use a faster scorer under the hood if given this freedom.

Let’s look at two example custom Collectors: BookLinkCollector and AllDocCollector.

Custom collector: BookLinkCollector
We’ve developed a custom Collector, called BookLinkCollector, which builds a map of all unique URLs and the corresponding book titles matching a query.BookLinkCollector is shown in listing 6.4.
- Listing 6.4 Custom Collector: collects all book links
  1. package demo.ch6;  
  2.   
  3. import java.io.IOException;  
  4. import java.util.Collections;  
  5. import java.util.HashMap;  
  6. import java.util.Map;  
  7.   
  8. import org.apache.lucene.document.Document;  
  9. import org.apache.lucene.index.AtomicReaderContext;  
  10. import org.apache.lucene.index.IndexReader;  
  11. import org.apache.lucene.search.Collector;  
  12. import org.apache.lucene.search.Scorer;  
  13.   
  14. public class BookLinkCollector extends Collector{  
  15.     private Map<String, String> documents = new HashMap<String, String>();  
  16.     private Scorer scorer;    
  17.     private int baseID;  
  18.     private IndexReader reader;  
  19.   
  20.     @Override  
  21.     public boolean acceptsDocsOutOfOrder() {  
  22.         return true;  
  23.     }  
  24.   
  25.     @Override  
  26.     public void setScorer(Scorer scorer) {  
  27.         this.scorer = scorer;  
  28.     }  
  29.         
  30.     @Override  
  31.     public void setNextReader(AtomicReaderContext context) throws IOException {       
  32.         reader = context.reader();                
  33.         baseID=context.docBase;  
  34.     }  
  35.       
  36.     @Override  
  37.     public void collect(int docID) {  
  38.         try {                             
  39.             Document doc = reader.document(docID+baseID);             
  40.             String url = doc.get("url");  
  41.             String title =doc.get("title2");  
  42.             String cnt = doc.get("contents");             
  43.             documents.put(url, title);  
  44.             System.out.println(title + ":" + scorer.score());             
  45.         } catch (IOException e) {  
  46.         }  
  47.     }  
  48.       
  49.     public Map<String,String> getLinks() {  
  50.         return Collections.unmodifiableMap(documents);  
  51.     }  
  52. }  
The collector differs from Lucene’s normal search result collection in that it does not retain the matching document IDs. Instead, for each matching document, it adds a mapping of URL to title into its private map, then makes that map available after the search completes. For this reason, even though we are passed the docBase insetNextReader, there’s no need to save it, as the urls and titles that we retrieve from the FieldCache are based on the per-segment document ID. Using our customCollector requires the use of IndexSearcher’s search method variant, as shown in listing 6.5.
- Listing 6.5 Testing the BookLinkCollector 
  1. public void testCollecting() throws Exception {       
  2.     BookLinkCollector collector = new BookLinkCollector();  
  3.     searcher.search(query, collector);  
  4.     Map<String, String> linkMap = collector.getLinks();  
  5.     assertEquals("AntInAction",  
  6.             linkMap.get("http://www.manning.com/loughran/"));     
  7. }  
During the search, Lucene delivers each matching docID to our collector; after the search finishes, we confirm that the link map created by the collector contains the right mapping for “ant in action.”

AllDocCollector
Sometimes you’d like to simply record every single matching document for a search, and you know the number of matches won’t be very large. Listing 6.6 shows a simple class, AllDocCollector, to do just that.
- Listing 6.6 A collector that gathers all matching documents and scores into a List
  1. package demo.ch6;  
  2.   
  3. import java.io.IOException;  
  4. import java.util.ArrayList;  
  5. import java.util.List;  
  6.   
  7. import org.apache.lucene.index.AtomicReaderContext;  
  8. import org.apache.lucene.search.Collector;  
  9. import org.apache.lucene.search.ScoreDoc;  
  10. import org.apache.lucene.search.Scorer;  
  11.   
  12. public class AllDocCollector extends Collector {  
  13.     List<ScoreDoc> docs = new ArrayList<ScoreDoc>();  
  14.     private Scorer scorer;  
  15.     private int docBase;  
  16.   
  17.     @Override  
  18.     public boolean acceptsDocsOutOfOrder() {  
  19.         return true;  
  20.     }  
  21.   
  22.     @Override  
  23.     public void setScorer(Scorer scorer) {  
  24.         this.scorer = scorer;  
  25.     }  
  26.   
  27.     @Override  
  28.     public void setNextReader(AtomicReaderContext context) {  
  29.         this.docBase = context.docBase;  
  30.     }  
  31.   
  32.     @Override  
  33.     public void collect(int doc) throws IOException {  
  34.         docs.add(new ScoreDoc(doc + docBase, scorer.score()));  
  35.     }  
  36.   
  37.     public void reset() {  
  38.         docs.clear();  
  39.     }  
  40.   
  41.     public List<ScoreDoc> getHits() {  
  42.         return docs;  
  43.     }  
  44. }  
You simply instantiate it, pass it to the search, and use the getHits() method to retrieve all hits. But note that the resulting docIDs might be out of sorted order becauseacceptsDocsOutOfOrder() returns true. Just change that to false, if this is a problem.

As you’ve seen, creating a custom Collector is quite simple. Lucene passes you the docIDs that match and you’re free to do what you want with them. We created one collector that populates a map, discarding the documents that match, and another that gathers all matching documents. The possibilities are endless!

Supplement
Lucene 4.x Change Log
LUCENE-2380: The String-based FieldCache methods (getStrings, getStringIndexhave been replaced with BytesRef-based equivalents (getTerms, getTermsIndex). Also, the sort values (returned in FieldDoc.fields) when sorting by SortField.STRING or SortField.STRING_VAL are now BytesRef instances.


沒有留言:

張貼留言

網誌存檔