Lucene’s relevance scoring formula, which we discussed in chapter 3, does a great job of assigning relevance to each document based on how well it matches the query. But what if you’d like to modify or override how this scoring is done? In section 5.2 you saw how you can change the default relevance sorting to sort instead by one or more fields, but what if you need even more flexibility? This is where function queries come in.
Function queries give you the freedom to programmatically assign scores to matching documents using your own logic. All classes are from theorg.apache.lucene.search.function package. In this section we first introduce the main classes used by function queries, and then see the real-world example of using function queries to boost recently modified documents.
Function query classes:
The base class for all function queries is ValueSourceQuery. This is a query that matches all documents but sets the score of each document according to a ValueSourceprovided during construction. The function package provides FieldCacheSource, and its subclasses, to derive values from the field cache. You can also create your ownValueSource—for example, to derive scores from an external database. But probably the simplest approach is to use FieldScoreQuery, which subclassesValueSourceQuery and derives each document’s score statically from a specific indexed field. The field should be a number, indexed without norms and with a single token per document. Typically you’d use Field.Index.NOT_ANALYZED_NO_NORMS. Let’s look at a simple example. First, include the field “score” in your documents:
- doc.add(new Field("score",
- "42",
- Field.Store.NO,
- Field.Index.NOT_ANALYZED_NO_NORMS));
- Query q = new FieldScoreQuery("score", FieldScoreQuery.Type.BYTE);
Our example is somewhat contrived; you could simply sort by the score field, descending, to achieve the same results. But function queries get more interesting when you combine them using the second type of function query, CustomScoreQuery. This query class lets you combine a normal Lucene query with one or more other function queries. We can now use the FieldScoreQuery we created earlier and a CustomScoreQuery to compute our own score:
- package ch5;
- import junit.framework.TestCase;
- ...
- public class CustomizedScoreTest extends TestCase{
- private IndexSearcher searcher;
- public static Version LUCENE_VERSION = Version.LUCENE_30;
- @Override
- protected void setUp() throws Exception {
- Calendar calendar = Calendar.getInstance();
- Directory directory = new RAMDirectory();
- IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, new WhitespaceAnalyzer(LUCENE_VERSION));
- IndexWriter writer = new IndexWriter(directory, iwConfig);
- Document doc1 = new Document();
- doc1.add(new Field("content",
- "the quick brown fox jumped over the lazy dog",
- Field.Store.YES, Field.Index.ANALYZED));
- doc1.add(new Field("title", "Test1", Field.Store.YES, Field.Index.ANALYZED));
- NumericField numField = new NumericField("score", Field.Store.NO, true); numField.setIntValue(42);
- doc1.add(numField);
- calendar.set(Calendar.DAY_OF_YEAR, calendar.get(Calendar.DAY_OF_YEAR)-10); // 再往前十天
- doc1.add(new NumericField("pubmonthAsDay",Field.Store.YES,true).setIntValue((int) (calendar.getTime().getTime()/(1000*3600*24))));
- writer.addDocument(doc1);
- Document doc2 = new Document();
- doc2.add(new Field("content", "the fast fox hopped over the hound",
- Field.Store.YES, Field.Index.ANALYZED));
- doc2.add(new Field("title", "Test2", Field.Store.YES, Field.Index.ANALYZED));
- numField = new NumericField("score", Field.Store.NO, true); numField.setIntValue(100);
- doc2.add(numField);
- calendar.set(Calendar.DAY_OF_YEAR, calendar.get(Calendar.DAY_OF_YEAR)+9); // 往前一天
- doc2.add(new NumericField("pubmonthAsDay",Field.Store.YES,true).setIntValue((int) (calendar.getTime().getTime()/(1000*3600*24))));
- writer.addDocument(doc2);
- writer.close();
- IndexReader reader = IndexReader.open(directory);
- searcher = new IndexSearcher(reader);
- }
- @Override
- protected void tearDown() throws Exception
- {
- searcher.close();
- }
- public void testCustomScore() throws Exception{
- Query query = new QueryParser(Version.LUCENE_30, "content",
- new StandardAnalyzer(Version.LUCENE_30)).parse("the fast fox");
- FieldScoreQuery qf = new FieldScoreQuery("score", FieldScoreQuery.Type.INT);
- CustomScoreQuery customQ = new CustomScoreQuery(query, qf) {
- public CustomScoreProvider getCustomScoreProvider(IndexReader r) {
- return new CustomScoreProvider(r) {
- public float customScore(int doc, float subQueryScore,
- float valSrcScore) {
- return (float) (Math.sqrt(subQueryScore) * valSrcScore);
- }
- };
- }
- };
- TopDocs hits = searcher.search(customQ, 10);
- for(ScoreDoc doc:hits.scoreDocs)
- {
- System.out.printf("DocID=%d; Score=%.02f...\n", doc.doc, doc.score);
- }
- }
- }
Boosting recently modified documents using function queries:
A real-world use of CustomScoreQuery is to perform document boosting. You can boost according to any custom criteria, but for our example, shown in listing 5.15, we boost recently modified documents using a new custom query class, RecencyBoostingQuery. In applications where documents have a clear timestamp, such as searching a newsfeed or press releases, boosting by recency can be useful. The class requires you to specify the name of a numeric field that contains the timestamp of each document that you’d like to use for boosting.
- Listing 5.15 Using recency to boost search results
- package ch5;
- import java.io.IOException;
- import java.util.Date;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.search.FieldCache;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.function.CustomScoreProvider;
- import org.apache.lucene.search.function.CustomScoreQuery;
- public class RecencyBoostingQuery extends CustomScoreQuery {
- double multiplier;
- int today;
- int maxDaysAgo;
- String dayField;
- static int MSEC_PER_DAY = 1000 * 3600 * 24;
- public RecencyBoostingQuery(Query q, double multiplier, int maxDaysAgo,
- String dayField) {
- super(q);
- today = (int) (new Date().getTime() / MSEC_PER_DAY);
- this.multiplier = multiplier;
- this.maxDaysAgo = maxDaysAgo;
- this.dayField = dayField;
- }
- private class RecencyBooster extends CustomScoreProvider {
- final int[] publishDay;
- public RecencyBooster(IndexReader r) throws IOException {
- super(r);
- publishDay = FieldCache.DEFAULT.getInts(r, dayField);
- }
- public float customScore(int doc, float subQueryScore, float valSrcScore) {
- int daysAgo = today - publishDay[doc];
- if (daysAgo < maxDaysAgo) {
- float boost = (float) (multiplier * (maxDaysAgo - daysAgo) / maxDaysAgo);
- return (float) (subQueryScore * (1.0 + boost));
- } else {
- return subQueryScore;
- }
- }
- }
- public CustomScoreProvider getCustomScoreProvider(IndexReader r)
- throws IOException {
- return new RecencyBooster(r);
- }
- }
- doc.add(new NumericField("pubmonthAsDay")
- .setIntValue((int) (d.getTime()/(1000*3600*24))));
Once the index is set up, using RecencyBoostingQuery is straightforward, as shown in listing 5.16.
Listing 5.16 Testing recency boosting
- public void testRecency() throws Throwable {
- searcher.setDefaultFieldSortScoring(true, true);
- QueryParser parser = new QueryParser(Version.LUCENE_30, "content", new StandardAnalyzer(Version.LUCENE_30));
- Query q = parser.parse("fox");
- Query q2 = new RecencyBoostingQuery(q, 100.0, 5, "pubmonthAsDay");
- Sort sort = new Sort(new SortField[] { SortField.FIELD_SCORE, new SortField("title", SortField.STRING) });
- TopDocs hits = searcher.search(q2, null, 5, sort);
- for (int i = 0; i < hits.scoreDocs.length; i++) {
- Document doc = searcher.doc(hits.scoreDocs[i].doc);
- System.out.println(hits.scoreDocs[i].doc + ": " + doc.get("title")
- + ": pubmonth=" + doc.get("pubmonthAsDay") + " score="
- + hits.scoreDocs[i].score);
- }
- }
If instead you run the search with q2, which boosts each result by recency, you’ll see this:
You can see that in the unboosted query, the top two results were tied based on relevance. But after factoring in recency boosting, the scores were different and the sort order changed.
This wraps up our coverage of function queries. Although we focused on one compelling example, boosting relevance scoring according to recency, function queries open up a whole universe of possibilities. You’re completely free to implement what-ever scoring you’d like.
您好,想请教个问题,实现自定义排序可以有两种方法:要么是extends fieldcomparator 和 extends CustomScoreProvider 有什么区别?
回覆刪除自问自答下了:custom sorting implementations are most useful in situations when the sort criteria can't be determined during indexing.区别就是是否可以借助index时的信息
回覆刪除