程式扎記: [ InAction Note ] Ch3. Adding search - Understanding Lucene scoring

標籤

2012年11月8日 星期四

[ InAction Note ] Ch3. Adding search - Understanding Lucene scoring


Preface:
Every time a document matches during search, it’s assigned a score that reflects how good the match is. This score computes how similar the document is to the query, with higher scores reflecting stronger similarity and thus stronger matches. We chose to discuss this complex topic early in this chapter so you’ll have a general sense of the various factors that go into Lucene scoring as you continue to read. We’ll start with details on Lucene’s scoring formula, and then show how you can see the full explanation of how a certain document arrived at its score.

How Lucene scores:
Without further ado, meet Lucene’s similarity scoring formula, shown in figure 3.3. It’s called the similarity scoring formula because its purpose is to measure the similarity between a query and each document that matches the query. The score is computed for each document (d) matching each term (t) in a query (q):


This score is the raw score, which is a floating-point number >= 0.0. Typically, if an application presents the score to the end user, it’s best to first normalize the scores by dividing all scores by the maximum score for the query. The larger the similarity score, the better the match of the document to the query. By default Lucene returns documents reverse-sorted by this score, meaning the top documents are the best matching ones. Table 3.5 describes each of the factors in the scoring formula.


Boost factors are built into the equation to let you affect a query or field’s influence on score. Field boosts come in explicitly in the equation as the boost(t.field in d)factor, set at indexing time. The default value of field boosts, logically, is 1.0. During indexing, a document can be assigned a boost, too. A document boost factor implicitly sets the starting field boost of all fields to the specified value. Field-specific boosts are multiplied by the starting value, giving the final value of the field boost factor. It’s possible to add the same named field to a document multiple times, and in such situations the field boost is computed as all the boosts specified for that field and document multiplied together.

In addition to the explicit factors in this equation, other factors can be computed on a per-query basis as part of the queryNorm factor. Queries themselves can have an impact on the document score. Boosting a Query instance is sensible only in a multiple-clause query; if only a single term is used for searching, changing its boost would impact all matched documents equally. In a multiple-clause Boolean query, some documents may match one clause but not another, enabling the boost factor to discriminate between matching documents. Queries also default to a 1.0 boost factor.

Most of these scoring formula factors are controlled and implemented as a subclass of the abstract Similarity class. DefaultSimilarity is the implementation used unless otherwise specified. More computations are performed under the covers of DefaultSimilarity; for example, the term frequency factor is the square root of the actual frequency. In practice, it’s extremely rare to need a change in these factors. Should you need to change them, please refer to Similarity’s Javadocs, and be prepared with a solid understanding of these factors and the effect your changes will have.

Using explain() to understand hit scoring:
Whew! The scoring formula seems daunting—and it is. We’re talking about factors that rank one document higher than another based on a query; that in and of itself deserves the sophistication going on. If you want to see how all these factors play out, Lucene provides a helpful feature called ExplanationIndexSearcher has anexplain method, which requires a Query and a document ID and returns an Explanation object.

The Explanation object internally contains all the gory details that factor into the score calculation. Each detail can be accessed individually if you like; but generally, dumping out the explanation in its entirety is desired. The .toString() method dumps a nicely formatted text representation of the Explanations. We wrote a simple program to dump Explanations, shown in listing 3.4.

- Listing 3.4 The explain() method
  1. package ch3;  
  2.   
  3. import java.io.File;  
  4. import java.io.IOException;  
  5.   
  6. import junit.framework.TestCase;  
  7.   
  8. import org.apache.lucene.analysis.Analyzer;  
  9. import org.apache.lucene.analysis.SimpleAnalyzer;  
  10. import org.apache.lucene.analysis.standard.StandardAnalyzer;  
  11. import org.apache.lucene.document.Document;  
  12. import org.apache.lucene.document.Field;  
  13. import org.apache.lucene.index.IndexReader;  
  14. import org.apache.lucene.index.IndexWriter;  
  15. import org.apache.lucene.index.IndexWriterConfig;  
  16. import org.apache.lucene.index.Term;  
  17. import org.apache.lucene.queryParser.QueryParser;  
  18. import org.apache.lucene.search.IndexSearcher;  
  19. import org.apache.lucene.search.Query;  
  20. import org.apache.lucene.search.TermQuery;  
  21. import org.apache.lucene.search.TopDocs;  
  22. import org.apache.lucene.store.Directory;  
  23. import org.apache.lucene.store.FSDirectory;  
  24. import org.apache.lucene.util.Version;  
  25.   
  26. public class ListExams extends TestCase{  
  27.     public static Version LUCENE_VERSION = Version.LUCENE_30;  
  28.     public Directory directory = null;  
  29.     public IndexSearcher searcher = null;  
  30.     public File indexpath = new File("./test");  
  31.     protected String[] ids = { "1""2" };  
  32.     protected String[] unindexed = { "Ant in Action""Junit in Action" };  
  33.     protected String[] unstored = { "Amsterdam has lots of bridges",  
  34.             "Venice has lots of canals" };  
  35.     protected String[] subject = { "Ant in Action with Junit""JUnit in Action, Second Edition" };  
  36.       
  37.     @Override  
  38.     protected void tearDown() throws Exception  
  39.     {  
  40.         //System.out.printf("\t[Test] tearDown...\n");  
  41.         searcher.close();  
  42.         directory.close();  
  43.         searcher = null;      
  44.         /*Thread.sleep(1000); 
  45.         File fs[] = indexpath.listFiles(); 
  46.         for(File f:fs)  
  47.         { 
  48.             System.out.printf("\t[Test] Delete %s...\n", f.getAbsolutePath()); 
  49.             f.delete(); 
  50.         } 
  51.         Thread.sleep(1000);*/  
  52.     }  
  53.       
  54.     @Override  
  55.     protected void setUp() throws Exception {  
  56.         //System.out.printf("\t[Test] setUp...\n");  
  57.         // 1) Run before every test  
  58.         directory = FSDirectory.open(indexpath);  
  59.         buildIndex();                 
  60.     }  
  61.       
  62.     protected void buildIndex() throws Exception  
  63.     {  
  64.         // 2) Cretae IndexWriter  
  65.         IndexWriter writer = getWriter();  
  66.   
  67.         // 3) Add document  
  68.         for (int i = 0; i < ids.length; i++) {  
  69.             Document doc = new Document();  
  70.             doc.add(new Field("id", ids[i], Field.Store.YES,  
  71.                     Field.Index.NOT_ANALYZED));  
  72.             doc.add(new Field("title", unindexed[i], Field.Store.YES,  
  73.                     Field.Index.NO));  
  74.             doc.add(new Field("contents", unstored[i], Field.Store.NO,  
  75.                     Field.Index.ANALYZED));  
  76.             doc.add(new Field("subject", subject[i], Field.Store.YES,  
  77.                     Field.Index.ANALYZED));  
  78.             writer.addDocument(doc);  
  79.         }  
  80.         writer.commit();  
  81.         writer.close();  
  82.     }  
  83.       
  84.     /** 
  85.      * BD: The StandardAnalyzer applies a LowerCaseFilter that would make search insensitive. 
  86.      * Reference: 
  87.      *      - How to make lucene be case-insensitive 
  88.      *        http://stackoverflow.com/questions/5512803/how-to-make-lucene-be-case-insensitive 
  89.      * @return 
  90.      * @throws IOException 
  91.      */  
  92.     private IndexWriter getWriter() throws IOException {  
  93.         Analyzer alyz = new StandardAnalyzer(LUCENE_VERSION);         
  94.         IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, alyz);  
  95.         iwConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);  
  96.         return new IndexWriter(directory, iwConfig);              
  97.     }  
  98.       
  99.     private IndexSearcher getSearcher() throws IOException  
  100.     {  
  101.         if(searcher==null)  
  102.         {  
  103.             IndexReader idxReader = IndexReader.open(directory);  
  104.             searcher = new IndexSearcher(idxReader);  
  105.         }  
  106.         return searcher;  
  107.     }  
  108.       
  109.     /** 
  110.      * BD: List 3.1 
  111.      * @throws Exception 
  112.      */  
  113.     public void testTerm() throws Exception {                 
  114.         // 1) Create IndexSearcher -> directory is built during setUp()  
  115.         IndexSearcher searcher = getSearcher();  
  116.           
  117.         // 2) Build Single Term Query  
  118.         Term t = new Term("subject""ant");  
  119.         Query query = new TermQuery(t);  
  120.           
  121.         // 3) Search  
  122.         TopDocs docs = searcher.search(query, 10);  
  123.           
  124.         // 4) Confirm one hit for 'ant' query.  
  125.         assertEquals("Ant in Action"1, docs.totalHits);  
  126.           
  127.         // 5) Search again  
  128.         t = new Term("subject""junit");  
  129.         docs = searcher.search(new TermQuery(t), 10);  
  130.           
  131.         // 6) Confirm two hit for 'junit' query.  
  132.         assertEquals("Ant in Action, " + "JUnit in Action, Second Edition",  
  133.                 2, docs.totalHits);  
  134.           
  135.         // 7) Close searcher and directory.       
  136.     }  
  137.       
  138.     /** 
  139.      * BD: List 3.2 - QueryParser, which makes it trivial to translate search text into a Query 
  140.      * @throws Exception 
  141.      */  
  142.     public void testQueryParser() throws Exception {  
  143.         // 1) Create IndexSearcher -> directory is built during setUp()  
  144.         IndexSearcher searcher = getSearcher();  
  145.   
  146.         // 2) Create QueryParser  
  147.         QueryParser parser = new QueryParser(LUCENE_VERSION, "subject",  
  148.                 new SimpleAnalyzer(LUCENE_VERSION));  
  149.           
  150.         // 3) Query subject to have "JUNIT", "ANT" but without "MOCK";  
  151.         Query query = parser.parse("+JUNIT +ANT -MOCK");  
  152.         TopDocs docs = searcher.search(query, 10);  
  153.           
  154.         // 4) Assert to have 1 hit.  
  155.         assertEquals(1, docs.totalHits);  
  156.           
  157.         // 5) Fetch the top1 document from search result.  
  158.         Document d = searcher.doc(docs.scoreDocs[0].doc);  
  159.           
  160.         // 6) Assert its title to be "Ant in Action".  
  161.         assertEquals("Ant in Action", d.get("title"));  
  162.           
  163.         // 7) Query again to have "mock" or "junit"  
  164.         query = parser.parse("mock OR junit");  
  165.         docs = searcher.search(query, 10);  
  166.           
  167.         // 8) Assert to have 2 hit.  
  168.         assertEquals("Ant in Action, " + "JUnit in Action, Second Edition"2,  
  169.                 docs.totalHits);  
  170.           
  171.         // 9) Close searcher and directory in tearDown()  
  172.     }  
  173. }  
Then we can use below sample code to test it (You may run example from Ch3. Adding search - Implementing a simple search feature to do index in first):
  1. File indexpath = new File("./test");  /*Index folder*/  
  2. String arg_set[] = {indexpath.getAbsolutePath(), "ant"}; /*Query term='ant'*/  
  3. Explainer.main(arg_set);  
The output result will look like:
Query: ant
----------
Ant in Action
0.5 = (MATCH) fieldWeight(subject:ant in 0), product of:
1.0 = tf(termFreq(subject:ant)=1)
1.0 = idf(docFreq=1, maxDocs=2)
0.5 = fieldNorm(field=subject, doc=0)


沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!