Preface:
If sorting by score, ID, or field values is insufficient for your needs, Lucene lets you implement a custom sorting mechanism by providing your own subclass of theFieldComparatorSource abstract base class. Custom sorting implementations are most useful in situations when the sort criteria can’t be determined during indexing.
For this section we’ll create a custom sort that orders search results based on geographic distance from a given location. The given location is only known at search time, and could, for example, be the geographic location of the user doing the search if the user is searching from a mobile device with an embedded global positioning service (GPS). First we show the required steps at indexing time. Next we’ll describe how to implement the custom sort during searching. Finally, you’ll learn how to access field values involved in the sorting for presentation purposes.
Indexing documents for geographic sorting
We’ve created a simplified demonstration of this concept using the important question, “What Mexican food restaurant is nearest to me?” Figure 6.1 shows a sample of restaurants and their fictitious grid coordinates on a sample 10 x 10 grid. Note that Lucene now includes the “spatial” package in the contrib modules, described in section 9.7, for filtering and sorting according to geographic distance in general.
The test data is indexed as shown in listing 6.1, with each place given a name, location in X and Y coordinates, and a type. The type field allows our data to accommodate other types of businesses and could allow us to filter search results to specific types of places.
- Listing 6.1 Indexing geographic data
The location could be encoded in numerous ways, but we opted for the simplest approach for this example.
Implementing custom geographic sort
Before we delve into the class that performs our custom sort, let’s look at the test case that we’re using to confirm that it’s working correctly:
Home is at coordinates (0,0). Our test has shown that the first and last documents in the returned results are the ones closest and furthest from home. Muy bien! Had we not used a sort, the documents would’ve been returned in insertion order, because the score of each hit is equivalent for the restaurant-type query. The distance computation, using the basic distance formula, is done under our custom
DistanceComparatorSource, shown in listing 6.2.
- Listing 6.2 DistanceComparatorSource
The sorting infrastructure within Lucene interacts with the
FieldComparatorSource and FieldComparator in order to sort matching documents. For performance reasons, this API is more complex than you’d otherwise expect. In particular, the comparator is made aware of the size of the queue (passed as the numHits argument tonewComparator) being tracked within Lucene. In addition, the comparator is notified every time a new segment is searched (with the setNextReader method).
Sorting by runtime information such as a user’s location is an incredibly powerful feature. At this point, though, we still have a missing piece: what’s the distance from each of the restaurants to our current location? When using the TopDocs-returning search methods, we can’t get to the distance computed. But a lower-level API lets us access the values used for sorting.
Accessing values used in custom sorting
The IndexSearcher.search method you use when sorting, covered in section 5.2, returns more information than the top documents:
TopFieldDocs is a subclass of TopDocs that adds the values used for sorting each hit. The values are available via each FieldDoc, which subclasses ScoreDoc, contained in the array of returned results. FieldDoc encapsulates the computed raw score, document ID, and an array of Comparables with the value used for each ScoreDoc. Rather than concerning ourselves with the details of the API, which you can get from Lucene’s Javadocs or the source code, let’s see how to use it.
Listing 6.3’s test case demonstrates the use of TopFieldDocs and FieldDoc to retrieve the distance computed during sorting.
- Listing 6.3 Accessing custom sorting values for search results
If sorting by score, ID, or field values is insufficient for your needs, Lucene lets you implement a custom sorting mechanism by providing your own subclass of theFieldComparatorSource abstract base class. Custom sorting implementations are most useful in situations when the sort criteria can’t be determined during indexing.
For this section we’ll create a custom sort that orders search results based on geographic distance from a given location. The given location is only known at search time, and could, for example, be the geographic location of the user doing the search if the user is searching from a mobile device with an embedded global positioning service (GPS). First we show the required steps at indexing time. Next we’ll describe how to implement the custom sort during searching. Finally, you’ll learn how to access field values involved in the sorting for presentation purposes.
Indexing documents for geographic sorting
We’ve created a simplified demonstration of this concept using the important question, “What Mexican food restaurant is nearest to me?” Figure 6.1 shows a sample of restaurants and their fictitious grid coordinates on a sample 10 x 10 grid. Note that Lucene now includes the “spatial” package in the contrib modules, described in section 9.7, for filtering and sorting according to geographic distance in general.
The test data is indexed as shown in listing 6.1, with each place given a name, location in X and Y coordinates, and a type. The type field allows our data to accommodate other types of businesses and could allow us to filter search results to specific types of places.
- Listing 6.1 Indexing geographic data
- package demo.ch6;
- import java.io.IOException;
- import junit.framework.TestCase;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.Field;
- import org.apache.lucene.document.FieldType;
- import org.apache.lucene.index.DirectoryReader;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.index.IndexWriter;
- import org.apache.lucene.index.IndexWriterConfig;
- import org.apache.lucene.index.Term;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.TermQuery;
- import org.apache.lucene.store.RAMDirectory;
- import org.apache.lucene.util.Version;
- public class DistanceSortingTest extends TestCase {
- private RAMDirectory directory;
- private IndexSearcher searcher;
- private Query query;
- private Version VER = Version.LUCENE_4_9;
- private FieldType ntFieldType;
- @Override
- protected void setUp() throws Exception {
- directory = new RAMDirectory();
- ntFieldType = new FieldType();
- ntFieldType.setIndexed(true);
- ntFieldType.setStored(true);
- IndexWriterConfig iwConfig = new IndexWriterConfig(VER, new StandardAnalyzer(VER));
- IndexWriter writer = new IndexWriter(directory, iwConfig);
- addPoint(writer, "El Charro", "restaurant", 1, 2);
- addPoint(writer, "Cafe Poca Cosa", "restaurant", 5, 9);
- addPoint(writer, "Los Betos", "restaurant", 9, 6);
- addPoint(writer, "Nico's Taco Shop", "restaurant", 3, 8);
- writer.close();
- IndexReader reader = DirectoryReader.open(directory);
- searcher = new IndexSearcher(reader);
- query = new TermQuery(new Term("type", "restaurant"));
- }
- private void addPoint(IndexWriter writer, String name, String type, int x, int y) throws IOException {
- Document doc = new Document();
- //System.out.printf("%s\n", ntFieldType);
- doc.add(new Field("name", name, ntFieldType));
- doc.add(new Field("type", type, ntFieldType));
- doc.add(new IntField("x", x, Field.Store.YES));
- doc.add(new IntField("y", y, Field.Store.YES));
- doc.add(new Field("location", x + "," + y, ntFieldType));
- writer.addDocument(doc);
- }
- }
Implementing custom geographic sort
Before we delve into the class that performs our custom sort, let’s look at the test case that we’re using to confirm that it’s working correctly:
- public void testNearestRestaurantToHome() throws Exception {
- Sort sort = new Sort(new SortField("location", new DistanceComparatorSource(0, 0)));
- TopDocs hits = searcher.search(query, null, 10, sort);
- //TopDocs hits = searcher.search(query, 10);
- System.out.printf("\t[Test] Hit %d...\n", hits.totalHits);
- assertTrue(hits.totalHits>0);
- assertEquals("closest",
- "El Charro",
- searcher.doc(hits.scoreDocs[0].doc).get("name"));
- assertEquals("furthest",
- "Los Betos",
- searcher.doc(hits.scoreDocs[3].doc).get("name"));
- }
- Listing 6.2 DistanceComparatorSource
- package demo.ch6;
- import java.io.IOException;
- import org.apache.lucene.index.AtomicReader;
- import org.apache.lucene.index.AtomicReaderContext;
- import org.apache.lucene.search.FieldCache;
- import org.apache.lucene.search.FieldComparator;
- import org.apache.lucene.search.FieldComparatorSource;
- // https://code.google.com/p/fattomato/source/browse/trunk/+fattomato/lucene3/test/com/lotus/lucene/distance/correct/DistanceComparatorSource.java?spec=svn140&r=140
- public class DistanceComparatorSource extends FieldComparatorSource { // #1
- private int x;
- private int y;
- public DistanceComparatorSource(int x, int y) { // #2
- this.x = x;
- this.y = y;
- }
- @Override
- public FieldComparator<Float> newComparator(java.lang.String fieldName, int numHits, int sortPos, boolean reversed) throws IOException {
- return new DistanceScoreDocLookupComparator(fieldName, numHits);
- }
- private class DistanceScoreDocLookupComparator extends FieldComparator<Float> {
- private FieldCache.Ints xDocs;
- private FieldCache.Ints yDocs;
- private float[] values; // #6
- private float bottom; // #7
- private float topVal;
- private String fieldName;
- public DistanceScoreDocLookupComparator(String fieldName, int numHits) throws IOException {
- values = new float[numHits];
- this.fieldName = fieldName;
- System.out.printf("\t[Test] FieldName=%s; numHits=%d\n", fieldName, numHits);
- }
- private float getDistance(int doc) { // #9
- int deltax = xDocs.get(doc) - x; // #9
- int deltay = yDocs.get(doc) - y; // #9
- return (float) Math.sqrt(deltax * deltax + deltay * deltay); // #9
- }
- @Override
- public int compare(int slot1, int slot2) { // #10
- // Compare a hit at 'slot1' with hit 'slot2'.
- return Float.valueOf(values[slot1]).compareTo(values[slot2]);
- }
- @Override
- public void setBottom(int slot) { // #11
- bottom = values[slot];
- }
- @Override
- public int compareBottom(int doc) { // #12
- // Compare a new hit (docID) against the "weakest" (bottom) entry in the queue.
- float docDistance = getDistance(doc);
- if (bottom < docDistance)
- return -1;
- if (bottom > docDistance)
- return 1;
- return 0;
- }
- @Override
- public void copy(int slot, int doc) {
- // Installs a new hit into the priority queue.
- // The FieldValueHitQueue calls this method when a new hit is competitive.
- values[slot] = getDistance(doc);
- }
- @Override
- public Float value(int slot) { // #14
- return values[slot]; // #14
- }
- @Override
- public int compareTop(int doc) throws IOException {
- // Compare a new hit (docID) against the top value previously set by
- // a call to setTopValue(T).
- float docDistance = getDistance(doc);
- if (topVal < docDistance)
- return -1;
- if (topVal > docDistance)
- return 1;
- return 0;
- }
- @Override
- public FieldComparator<Float> setNextReader(AtomicReaderContext aRC) throws IOException {
- // Invoked when the search is switching to the next segment. You may need to update internal state of the comparator,
- // for example retrieving new values from the FieldCache.
- try
- {
- AtomicReader ar = aRC.reader();
- System.out.printf("\t[Test] setNextReader...\n");
- xDocs = FieldCache.DEFAULT.getInts(ar, "x", false);
- yDocs = FieldCache.DEFAULT.getInts(ar, "y", false);
- }
- catch(Exception e){e.printStackTrace();}
- return this;
- }
- @Override
- public void setTopValue(Float top) {
- topVal=top;
- }
- }
- public String toString() {
- return "Distance from (" + x + "," + y + ")";
- }
- }
Sorting by runtime information such as a user’s location is an incredibly powerful feature. At this point, though, we still have a missing piece: what’s the distance from each of the restaurants to our current location? When using the TopDocs-returning search methods, we can’t get to the distance computed. But a lower-level API lets us access the values used for sorting.
Accessing values used in custom sorting
The IndexSearcher.search method you use when sorting, covered in section 5.2, returns more information than the top documents:
- public TopFieldDocs search(Query query, Filter filter, int nDocs, Sort sort)
Listing 6.3’s test case demonstrates the use of TopFieldDocs and FieldDoc to retrieve the distance computed during sorting.
- Listing 6.3 Accessing custom sorting values for search results
- public void testNeareastRestaurantToWork() throws Exception {
- Sort sort = new Sort(new SortField("location", new DistanceComparatorSource(10, 10)));
- TopFieldDocs docs = searcher.search(query, null, 3, sort); // 1)
- assertEquals(4, docs.totalHits); // 2)
- assertEquals(3, docs.scoreDocs.length); // 3)
- FieldDoc fieldDoc = (FieldDoc) docs.scoreDocs[0]; // 4)
- assertEquals("(10,10) -> (9,6) = sqrt(17)",
- new Float(Math.sqrt(17)),
- fieldDoc.fields[0]); // 5)
- Document document = searcher.doc(fieldDoc.doc); // 6)
- assertEquals("Los Betos", document.get("name"));
- }
This message was edited 20 times. Last update was at 13/07/2014 19:43:22
沒有留言:
張貼留言