程式扎記: [ InAction Note ] Ch5. Advanced search techniques - Custom scoring using function queries

標籤

2013年6月22日 星期六

[ InAction Note ] Ch5. Advanced search techniques - Custom scoring using function queries

Preface: 
Lucene’s relevance scoring formula, which we discussed in chapter 3, does a great job of assigning relevance to each document based on how well it matches the query. But what if you’d like to modify or override how this scoring is done? In section 5.2 you saw how you can change the default relevance sorting to sort instead by one or more fields, but what if you need even more flexibility? This is where function queries come in. 

Function queries give you the freedom to programmatically assign scores to matching documents using your own logic. All classes are from theorg.apache.lucene.search.function package. In this section we first introduce the main classes used by function queries, and then see the real-world example of using function queries to boost recently modified documents. 

Function query classes: 
The base class for all function queries is ValueSourceQuery. This is a query that matches all documents but sets the score of each document according to a ValueSourceprovided during construction. The function package provides FieldCacheSource, and its subclasses, to derive values from the field cache. You can also create your ownValueSource—for example, to derive scores from an external database. But probably the simplest approach is to use FieldScoreQuery, which subclassesValueSourceQuery and derives each document’s score statically from a specific indexed field. The field should be a number, indexed without norms and with a single token per document. Typically you’d use Field.Index.NOT_ANALYZED_NO_NORMS. Let’s look at a simple example. First, include the field “score” in your documents: 
  1. doc.add(new Field("score",  
  2.                   "42",  
  3.                   Field.Store.NO,  
  4.                   Field.Index.NOT_ANALYZED_NO_NORMS));  
Then, create this function query: 
  1. Query q = new FieldScoreQuery("score", FieldScoreQuery.Type.BYTE);  
That query matches all documents, assigning each a score according to the contents of its “score” field. You can also use the SHORT, INT, or FLOAT constants. Under the hood, this function query uses the field cache, so the important trade-offs described in section 5.1 apply. 

Our example is somewhat contrived; you could simply sort by the score field, descending, to achieve the same results. But function queries get more interesting when you combine them using the second type of function query, CustomScoreQuery. This query class lets you combine a normal Lucene query with one or more other function queries. We can now use the FieldScoreQuery we created earlier and a CustomScoreQuery to compute our own score: 
  1. package ch5;  
  2.   
  3. import junit.framework.TestCase;  
  4.   
  5. ...  
  6.   
  7. public class CustomizedScoreTest extends TestCase{  
  8.     private IndexSearcher searcher;  
  9.     public static Version LUCENE_VERSION = Version.LUCENE_30;  
  10.   
  11.     @Override  
  12.     protected void setUp() throws Exception {  
  13.         Calendar calendar = Calendar.getInstance();  
  14.         Directory directory = new RAMDirectory();  
  15.         IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, new WhitespaceAnalyzer(LUCENE_VERSION));  
  16.         IndexWriter writer = new IndexWriter(directory, iwConfig);  
  17.         Document doc1 = new Document();  
  18.         doc1.add(new Field("content",  
  19.                 "the quick brown fox jumped over the lazy dog",  
  20.                 Field.Store.YES, Field.Index.ANALYZED));  
  21.         doc1.add(new Field("title""Test1", Field.Store.YES, Field.Index.ANALYZED));  
  22.         NumericField numField = new NumericField("score", Field.Store.NO, true); numField.setIntValue(42);  
  23.         doc1.add(numField);  
  24.         calendar.set(Calendar.DAY_OF_YEAR, calendar.get(Calendar.DAY_OF_YEAR)-10); // 再往前十天  
  25.         doc1.add(new NumericField("pubmonthAsDay",Field.Store.YES,true).setIntValue((int) (calendar.getTime().getTime()/(1000*3600*24))));   
  26.         writer.addDocument(doc1);  
  27.           
  28.         Document doc2 = new Document();       
  29.         doc2.add(new Field("content""the fast fox hopped over the hound",  
  30.                 Field.Store.YES, Field.Index.ANALYZED));  
  31.         doc2.add(new Field("title""Test2", Field.Store.YES, Field.Index.ANALYZED));  
  32.         numField = new NumericField("score", Field.Store.NO, true); numField.setIntValue(100);  
  33.         doc2.add(numField);       
  34.         calendar.set(Calendar.DAY_OF_YEAR, calendar.get(Calendar.DAY_OF_YEAR)+9); // 往前一天  
  35.         doc2.add(new NumericField("pubmonthAsDay",Field.Store.YES,true).setIntValue((int) (calendar.getTime().getTime()/(1000*3600*24))));   
  36.         writer.addDocument(doc2);  
  37.         writer.close();  
  38.         IndexReader reader = IndexReader.open(directory);  
  39.         searcher = new IndexSearcher(reader);  
  40.     }  
  41.       
  42.     @Override  
  43.     protected void tearDown() throws Exception  
  44.     {  
  45.         searcher.close();  
  46.     }  
  47.       
  48.     public void testCustomScore() throws Exception{  
  49.         Query query = new QueryParser(Version.LUCENE_30, "content",  
  50.                 new StandardAnalyzer(Version.LUCENE_30)).parse("the fast fox");  
  51.         FieldScoreQuery qf = new FieldScoreQuery("score", FieldScoreQuery.Type.INT);  
  52.         CustomScoreQuery customQ = new CustomScoreQuery(query, qf) {  
  53.             public CustomScoreProvider getCustomScoreProvider(IndexReader r) {  
  54.                 return new CustomScoreProvider(r) {  
  55.                     public float customScore(int doc, float subQueryScore,  
  56.                             float valSrcScore) {                          
  57.                         return (float) (Math.sqrt(subQueryScore) * valSrcScore);  
  58.                     }  
  59.                 };  
  60.             }  
  61.         };  
  62.           
  63.         TopDocs hits = searcher.search(customQ, 10);  
  64.         for(ScoreDoc doc:hits.scoreDocs)  
  65.         {  
  66.             System.out.printf("DocID=%d; Score=%.02f...\n", doc.doc, doc.score);  
  67.         }  
  68.     }  
  69. }  
In this case we create a normal query query by parsing the user’s search text. We next create the same FieldScoreQuery we used earlier to assign a static score to documents according to the score field. Finally, we create a CustomScoreQuery, overriding the getCustomScoreProvider method to return a class containing thecustomScore method to compute our score for each matching document. In this contrived case, we take the square root of the incoming query score and then multiply it by the static score provided by the FieldScoreQuery. You can use arbitrary logic to create your scores. 

Boosting recently modified documents using function queries: 
A real-world use of CustomScoreQuery is to perform document boosting. You can boost according to any custom criteria, but for our example, shown in listing 5.15, we boost recently modified documents using a new custom query class, RecencyBoostingQuery. In applications where documents have a clear timestamp, such as searching a newsfeed or press releases, boosting by recency can be useful. The class requires you to specify the name of a numeric field that contains the timestamp of each document that you’d like to use for boosting. 
- Listing 5.15 Using recency to boost search results 
  1. package ch5;  
  2.   
  3. import java.io.IOException;  
  4. import java.util.Date;  
  5.   
  6. import org.apache.lucene.index.IndexReader;  
  7. import org.apache.lucene.search.FieldCache;  
  8. import org.apache.lucene.search.Query;  
  9. import org.apache.lucene.search.function.CustomScoreProvider;  
  10. import org.apache.lucene.search.function.CustomScoreQuery;  
  11.   
  12. public class RecencyBoostingQuery extends CustomScoreQuery {  
  13.     double multiplier;  
  14.     int today;  
  15.     int maxDaysAgo;  
  16.     String dayField;  
  17.     static int MSEC_PER_DAY = 1000 * 3600 * 24;  
  18.   
  19.     public RecencyBoostingQuery(Query q, double multiplier, int maxDaysAgo,  
  20.             String dayField) {  
  21.         super(q);  
  22.         today = (int) (new Date().getTime() / MSEC_PER_DAY);  
  23.         this.multiplier = multiplier;  
  24.         this.maxDaysAgo = maxDaysAgo;  
  25.         this.dayField = dayField;  
  26.     }  
  27.   
  28.     private class RecencyBooster extends CustomScoreProvider {  
  29.         final int[] publishDay;  
  30.   
  31.         public RecencyBooster(IndexReader r) throws IOException {  
  32.             super(r);  
  33.             publishDay = FieldCache.DEFAULT.getInts(r, dayField);  
  34.         }  
  35.   
  36.         public float customScore(int doc, float subQueryScore, float valSrcScore) {  
  37.             int daysAgo = today - publishDay[doc];  
  38.             if (daysAgo < maxDaysAgo) {  
  39.                 float boost = (float) (multiplier * (maxDaysAgo - daysAgo) / maxDaysAgo);  
  40.                 return (float) (subQueryScore * (1.0 + boost));  
  41.             } else {  
  42.                 return subQueryScore;  
  43.             }  
  44.         }  
  45.     }  
  46.   
  47.     public CustomScoreProvider getCustomScoreProvider(IndexReader r)  
  48.             throws IOException {  
  49.         return new RecencyBooster(r);  
  50.     }  
  51. }  
In our case, we previously indexed the pubmonthAsDay field, like this: 
  1. doc.add(new NumericField("pubmonthAsDay")  
  2.              .setIntValue((int) (d.getTime()/(1000*3600*24))));  
See section 2.6.2 for options when indexing dates and times. 

Once the index is set up, using RecencyBoostingQuery is straightforward, as shown in listing 5.16. 
Listing 5.16 Testing recency boosting 
  1. public void testRecency() throws Throwable {  
  2.     searcher.setDefaultFieldSortScoring(truetrue);  
  3.     QueryParser parser = new QueryParser(Version.LUCENE_30, "content"new StandardAnalyzer(Version.LUCENE_30));  
  4.     Query q = parser.parse("fox");  
  5.     Query q2 = new RecencyBoostingQuery(q, 100.05"pubmonthAsDay");  
  6.     Sort sort = new Sort(new SortField[] { SortField.FIELD_SCORE, new SortField("title", SortField.STRING) });  
  7.     TopDocs hits = searcher.search(q2, null5, sort);  
  8.     for (int i = 0; i < hits.scoreDocs.length; i++) {  
  9.         Document doc = searcher.doc(hits.scoreDocs[i].doc);  
  10.         System.out.println(hits.scoreDocs[i].doc + ": " + doc.get("title")  
  11.                 + ": pubmonth=" + doc.get("pubmonthAsDay") + " score="  
  12.                 + hits.scoreDocs[i].score);  
  13.     }  
  14. }  
We first create a normal query, by parsing the search string "the quick brown fox", and then instantiate the RecencyBoostingQuery, giving a boost factor of up to 100.0 for any book published within the past 5 days. Then we run the search, sorting first by relevance score and second by title. The test as shown in listing 5.16 runs the unboosted query q, producing this result: 
0: Test1: pubmonth=15868 score=0.4794072
1: Test2: pubmonth=15877 score=0.028801177

If instead you run the search with q2, which boosts each result by recency, you’ll see this: 
1: Test2: pubmonth=15877 score=2.3328953
0: Test1: pubmonth=15868 score=0.4794072

You can see that in the unboosted query, the top two results were tied based on relevance. But after factoring in recency boosting, the scores were different and the sort order changed. 

This wraps up our coverage of function queries. Although we focused on one compelling example, boosting relevance scoring according to recency, function queries open up a whole universe of possibilities. You’re completely free to implement what-ever scoring you’d like.

2 則留言:

  1. 您好,想请教个问题,实现自定义排序可以有两种方法:要么是extends fieldcomparator 和 extends CustomScoreProvider 有什么区别?

    回覆刪除
  2. 自问自答下了:custom sorting implementations are most useful in situations when the sort criteria can't be determined during indexing.区别就是是否可以借助index时的信息

    回覆刪除

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!