程式扎記: [ InAction Note ] Ch6. Extending search - Using a custom sort method

標籤

2014年7月13日 星期日

[ InAction Note ] Ch6. Extending search - Using a custom sort method

Preface:
If sorting by score, ID, or field values is insufficient for your needs, Lucene lets you implement a custom sorting mechanism by providing your own subclass of theFieldComparatorSource abstract base class. Custom sorting implementations are most useful in situations when the sort criteria can’t be determined during indexing.

For this section we’ll create a custom sort that orders search results based on geographic distance from a given location. The given location is only known at search time, and could, for example, be the geographic location of the user doing the search if the user is searching from a mobile device with an embedded global positioning service (GPS). First we show the required steps at indexing time. Next we’ll describe how to implement the custom sort during searching. Finally, you’ll learn how to access field values involved in the sorting for presentation purposes.

Indexing documents for geographic sorting
We’ve created a simplified demonstration of this concept using the important question, “What Mexican food restaurant is nearest to me?” Figure 6.1 shows a sample of restaurants and their fictitious grid coordinates on a sample 10 x 10 grid. Note that Lucene now includes the “spatial” package in the contrib modules, described in section 9.7, for filtering and sorting according to geographic distance in general.


The test data is indexed as shown in listing 6.1, with each place given a name, location in X and Y coordinates, and a type. The type field allows our data to accommodate other types of businesses and could allow us to filter search results to specific types of places.
- Listing 6.1 Indexing geographic data
  1. package demo.ch6;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. import junit.framework.TestCase;  
  6.   
  7. import org.apache.lucene.analysis.standard.StandardAnalyzer;  
  8. import org.apache.lucene.document.Document;  
  9. import org.apache.lucene.document.Field;  
  10. import org.apache.lucene.document.FieldType;  
  11. import org.apache.lucene.index.DirectoryReader;  
  12. import org.apache.lucene.index.IndexReader;  
  13. import org.apache.lucene.index.IndexWriter;  
  14. import org.apache.lucene.index.IndexWriterConfig;  
  15. import org.apache.lucene.index.Term;  
  16. import org.apache.lucene.search.IndexSearcher;  
  17. import org.apache.lucene.search.Query;  
  18. import org.apache.lucene.search.TermQuery;  
  19. import org.apache.lucene.store.RAMDirectory;  
  20. import org.apache.lucene.util.Version;  
  21.   
  22. public class DistanceSortingTest extends TestCase {  
  23.     private RAMDirectory directory;  
  24.     private IndexSearcher searcher;  
  25.     private Query query;  
  26.     private Version VER = Version.LUCENE_4_9;  
  27.     private FieldType ntFieldType;  
  28.   
  29.     @Override  
  30.     protected void setUp() throws Exception {  
  31.         directory = new RAMDirectory();  
  32.         ntFieldType = new FieldType();  
  33.         ntFieldType.setIndexed(true);  
  34.         ntFieldType.setStored(true);  
  35.         IndexWriterConfig iwConfig = new IndexWriterConfig(VER, new StandardAnalyzer(VER));  
  36.         IndexWriter writer = new IndexWriter(directory, iwConfig);  
  37.         addPoint(writer, "El Charro""restaurant"12);  
  38.         addPoint(writer, "Cafe Poca Cosa""restaurant"59);  
  39.         addPoint(writer, "Los Betos""restaurant"96);  
  40.         addPoint(writer, "Nico's Taco Shop""restaurant"38);  
  41.         writer.close();  
  42.         IndexReader reader = DirectoryReader.open(directory);  
  43.         searcher = new IndexSearcher(reader);  
  44.         query = new TermQuery(new Term("type""restaurant"));            
  45.     }  
  46.   
  47.     private void addPoint(IndexWriter writer, String name, String type, int x, int y) throws IOException {  
  48.         Document doc = new Document();  
  49.         //System.out.printf("%s\n", ntFieldType);  
  50.         doc.add(new Field("name", name, ntFieldType));  
  51.         doc.add(new Field("type", type, ntFieldType));  
  52.         doc.add(new IntField("x", x, Field.Store.YES));  
  53.         doc.add(new IntField("y", y, Field.Store.YES));  
  54.         doc.add(new Field("location", x + "," + y, ntFieldType));  
  55.         writer.addDocument(doc);  
  56.     }  
  57. }  
The location could be encoded in numerous ways, but we opted for the simplest approach for this example.


Implementing custom geographic sort
Before we delve into the class that performs our custom sort, let’s look at the test case that we’re using to confirm that it’s working correctly:
  1. public void testNearestRestaurantToHome() throws Exception {  
  2.       Sort sort = new Sort(new SortField("location"new DistanceComparatorSource(00)));  
  3.       TopDocs hits = searcher.search(query, null10, sort);  
  4.       //TopDocs hits = searcher.search(query, 10);  
  5.       System.out.printf("\t[Test] Hit %d...\n", hits.totalHits);  
  6.       assertTrue(hits.totalHits>0);  
  7.       assertEquals("closest",  
  8.                    "El Charro",  
  9.                    searcher.doc(hits.scoreDocs[0].doc).get("name"));  
  10.       assertEquals("furthest",  
  11.                    "Los Betos",  
  12.                    searcher.doc(hits.scoreDocs[3].doc).get("name"));  
  13. }  
Home is at coordinates (0,0). Our test has shown that the first and last documents in the returned results are the ones closest and furthest from home. Muy bien! Had we not used a sort, the documents would’ve been returned in insertion order, because the score of each hit is equivalent for the restaurant-type query. The distance computation, using the basic distance formula, is done under our custom DistanceComparatorSource, shown in listing 6.2.
- Listing 6.2 DistanceComparatorSource
  1. package demo.ch6;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. import org.apache.lucene.index.AtomicReader;  
  6. import org.apache.lucene.index.AtomicReaderContext;  
  7. import org.apache.lucene.search.FieldCache;  
  8. import org.apache.lucene.search.FieldComparator;  
  9. import org.apache.lucene.search.FieldComparatorSource;  
  10.   
  11. // https://code.google.com/p/fattomato/source/browse/trunk/+fattomato/lucene3/test/com/lotus/lucene/distance/correct/DistanceComparatorSource.java?spec=svn140&r=140  
  12. public class DistanceComparatorSource extends FieldComparatorSource { // #1  
  13.     private int x;  
  14.     private int y;  
  15.   
  16.     public DistanceComparatorSource(int x, int y) { // #2  
  17.         this.x = x;  
  18.         this.y = y;  
  19.     }  
  20.   
  21.     @Override  
  22.     public FieldComparator<Float> newComparator(java.lang.String fieldName, int numHits, int sortPos, boolean reversed) throws IOException {  
  23.         return new DistanceScoreDocLookupComparator(fieldName, numHits);  
  24.     }  
  25.   
  26.     private class DistanceScoreDocLookupComparator extends FieldComparator<Float> {         
  27.         private FieldCache.Ints xDocs;  
  28.         private FieldCache.Ints yDocs;  
  29.         private float[] values; // #6  
  30.         private float bottom; // #7  
  31.         private float topVal;  
  32.         private String fieldName;  
  33.   
  34.         public DistanceScoreDocLookupComparator(String fieldName, int numHits) throws IOException {  
  35.             values = new float[numHits];  
  36.             this.fieldName = fieldName;  
  37.             System.out.printf("\t[Test] FieldName=%s; numHits=%d\n", fieldName, numHits);  
  38.         }  
  39.   
  40.         private float getDistance(int doc) { // #9                
  41.             int deltax = xDocs.get(doc) - x; // #9            
  42.             int deltay = yDocs.get(doc) - y; // #9            
  43.             return (float) Math.sqrt(deltax * deltax + deltay * deltay); // #9  
  44.         }  
  45.           
  46.         @Override  
  47.         public int compare(int slot1, int slot2) { // #10  
  48.             // Compare a hit at 'slot1' with hit 'slot2'.  
  49.             return Float.valueOf(values[slot1]).compareTo(values[slot2]);             
  50.         }  
  51.   
  52.         @Override  
  53.         public void setBottom(int slot) { // #11              
  54.             bottom = values[slot];  
  55.         }  
  56.   
  57.         @Override  
  58.         public int compareBottom(int doc) { // #12  
  59.             // Compare a new hit (docID) against the "weakest" (bottom) entry in the queue.  
  60.             float docDistance = getDistance(doc);  
  61.             if (bottom < docDistance)  
  62.                 return -1;  
  63.             if (bottom > docDistance)  
  64.                 return 1;  
  65.             return 0;  
  66.         }  
  67.   
  68.         @Override  
  69.         public void copy(int slot, int doc) {  
  70.             // Installs a new hit into the priority queue.   
  71.             // The FieldValueHitQueue calls this method when a new hit is competitive.  
  72.             values[slot] = getDistance(doc);              
  73.         }  
  74.   
  75.         @Override  
  76.         public Float value(int slot) { // #14                     
  77.             return values[slot]; // #14  
  78.         }  
  79.   
  80.         @Override  
  81.         public int compareTop(int doc) throws IOException {  
  82.             // Compare a new hit (docID) against the top value previously set by  
  83.             // a call to setTopValue(T).  
  84.             float docDistance = getDistance(doc);  
  85.             if (topVal < docDistance)  
  86.                 return -1;  
  87.             if (topVal > docDistance)  
  88.                 return 1;  
  89.             return 0;  
  90.         }  
  91.   
  92.         @Override  
  93.         public FieldComparator<Float> setNextReader(AtomicReaderContext aRC) throws IOException {           
  94.             // Invoked when the search is switching to the next segment. You may need to update internal state of the comparator,   
  95.             // for example retrieving new values from the FieldCache.  
  96.             try  
  97.             {  
  98.                 AtomicReader ar = aRC.reader();   
  99.                 System.out.printf("\t[Test] setNextReader...\n");  
  100.                 xDocs = FieldCache.DEFAULT.getInts(ar, "x"false);  
  101.                 yDocs = FieldCache.DEFAULT.getInts(ar, "y"false);  
  102.             }  
  103.             catch(Exception e){e.printStackTrace();}  
  104.             return this;  
  105.         }  
  106.   
  107.         @Override  
  108.         public void setTopValue(Float top) {                  
  109.             topVal=top;  
  110.         }  
  111.     }  
  112.   
  113.     public String toString() {  
  114.         return "Distance from (" + x + "," + y + ")";  
  115.     }  
  116. }  
The sorting infrastructure within Lucene interacts with the FieldComparatorSource and FieldComparator in order to sort matching documents. For performance reasons, this API is more complex than you’d otherwise expect. In particular, the comparator is made aware of the size of the queue (passed as the numHits argument tonewComparator) being tracked within Lucene. In addition, the comparator is notified every time a new segment is searched (with the setNextReader method).

Sorting by runtime information such as a user’s location is an incredibly powerful feature. At this point, though, we still have a missing piece: what’s the distance from each of the restaurants to our current location? When using the TopDocs-returning search methods, we can’t get to the distance computed. But a lower-level API lets us access the values used for sorting.

Accessing values used in custom sorting
The IndexSearcher.search method you use when sorting, covered in section 5.2, returns more information than the top documents:
  1. public TopFieldDocs search(Query query, Filter filter, int nDocs, Sort sort)  
TopFieldDocs is a subclass of TopDocs that adds the values used for sorting each hit. The values are available via each FieldDoc, which subclasses ScoreDoc, contained in the array of returned results. FieldDoc encapsulates the computed raw score, document ID, and an array of Comparables with the value used for each ScoreDoc. Rather than concerning ourselves with the details of the API, which you can get from Lucene’s Javadocs or the source code, let’s see how to use it.

Listing 6.3’s test case demonstrates the use of TopFieldDocs and FieldDoc to retrieve the distance computed during sorting.
- Listing 6.3 Accessing custom sorting values for search results
  1. public void testNeareastRestaurantToWork() throws Exception {  
  2.     Sort sort = new Sort(new SortField("location"new DistanceComparatorSource(1010)));  
  3.     TopFieldDocs docs = searcher.search(query, null3, sort);  // 1)  
  4.     assertEquals(4, docs.totalHits);                            // 2)  
  5.     assertEquals(3, docs.scoreDocs.length);                     // 3)  
  6.     FieldDoc fieldDoc = (FieldDoc) docs.scoreDocs[0];           // 4)  
  7.           
  8.     assertEquals("(10,10) -> (9,6) = sqrt(17)",   
  9.                  new Float(Math.sqrt(17)),   
  10.                  fieldDoc.fields[0]);                           // 5)   
  11.     Document document = searcher.doc(fieldDoc.doc);             // 6)  
  12.     assertEquals("Los Betos", document.get("name"));  
  13. }  
(1) This lower-level API requires that we specify the maximum number of hits returned.
(2) The total number of hits is still provided because all hits need to be determined to find the three best ones.
(3) The total number of documents (up to the maximum specified) are returned.
(4) docs.scoreDocs(0) returns a ScoreDoc and must be cast to FieldDoc to get sorting values.
(5) The value of the first (and only, in this example) SortField computation is available in the first fields slot.
(6) Getting the actual Document requires another call.

This message was edited 20 times. Last update was at 13/07/2014 19:43:22

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!