程式扎記: [ InAction Note ] Ch5. Advanced search techniques - Querying on multiple fields at once

標籤

2013年5月13日 星期一

[ InAction Note ] Ch5. Advanced search techniques - Querying on multiple fields at once

Preface: 
In our book data, several fields were indexed to separately hold the title, category, author, subject, and so forth. But when searching a user would typically like to search across all fields at once. You could require users to spell out each field name, but except for specialized cases, that’s requiring far too much work on your users’ part. Users much prefer to search all fields, by default, unless a specific field is requested. We cover three possible approaches here. 




First approach: 
The first approach is to create a multivalued catchall field to index the text from all fields, as we’ve done for the contents field in our book test index. Be sure to increase the position increment gap across field values, as described in section 4.7.1, to avoid incorrectly matching across two field values. You then perform all searching against the catchall field. This approach has some downsides: you can’t directly control per-field boosting 1 , and disk space is wasted, assuming you also index each field separately

Second approach: 
The second approach is to use MultiFieldQueryParser, which subclasses QueryParser. Under the covers, it instantiates a QueryParser, parses the query expression for each field, then combines the resulting queries using a BooleanQuery. The default operator OR is used in the simplest parse method when adding the clauses to theBooleanQuery. For finer control, the operator can be specified for each field as required (BooleanClause.Occur.MUST), prohibited (BooleanClause.Occur.MUST_NOT), or normal (BooleanClause.Occur.SHOULD), using the constants from BooleanClause

Listing 5.7 shows this heavier QueryParser variant in use. The testDefaultOperator() method first parses the query "development" using both the title and subject fields. The test shows that documents match based on either of those fields. The second test, testSpecifiedOperator(), sets the parsing to mandate that documents must match the expression in all specified fields and searches using the query "lucene". 
- Listing 5.7 MultiFieldQueryParser, which searches on multiple fields at once 
  1. package ch5;  
  2.   
  3. import junit.framework.TestCase;  
  4.   
  5. import org.apache.lucene.analysis.SimpleAnalyzer;  
  6. import org.apache.lucene.document.Document;  
  7. import org.apache.lucene.document.Field;  
  8. import org.apache.lucene.index.IndexReader;  
  9. import org.apache.lucene.index.IndexWriter;  
  10. import org.apache.lucene.index.IndexWriterConfig;  
  11. import org.apache.lucene.queryParser.MultiFieldQueryParser;  
  12. import org.apache.lucene.search.BooleanClause;  
  13. import org.apache.lucene.search.IndexSearcher;  
  14. import org.apache.lucene.search.Query;  
  15. import org.apache.lucene.search.TopDocs;  
  16. import org.apache.lucene.store.Directory;  
  17. import org.apache.lucene.store.RAMDirectory;  
  18. import org.apache.lucene.util.Version;  
  19.   
  20. public class MultiFieldQueryParserTest extends TestCase {  
  21.     private IndexSearcher searcher;  
  22.     public static Version LUCENE_VERSION = Version.LUCENE_30;  
  23.   
  24.     @Override  
  25.     protected void setUp() throws Exception {  
  26.         Directory directory = new RAMDirectory();  
  27.         IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, new SimpleAnalyzer(LUCENE_VERSION));  
  28.         IndexWriter writer = new IndexWriter(directory, iwConfig);  
  29.         Document doc = new Document();  
  30.         doc.add(new Field("title",  
  31.                 "Ant in Action",  
  32.                 Field.Store.YES, Field.Index.ANALYZED));  
  33.         doc.add(new Field("subject""This help you to use Ant in development.", Field.Store.YES, Field.Index.ANALYZED));  
  34.         writer.addDocument(doc);  
  35.         doc = new Document();  
  36.         doc.add(new Field("title""Java development quide",  
  37.                 Field.Store.YES, Field.Index.ANALYZED));  
  38.         doc.add(new Field("subject""This document help you to learn java programming",  
  39.                 Field.Store.YES, Field.Index.ANALYZED));  
  40.         writer.addDocument(doc);  
  41.         doc = new Document();  
  42.         doc.add(new Field("title""Lucene in action",  
  43.                 Field.Store.YES, Field.Index.ANALYZED));  
  44.         doc.add(new Field("subject""This document covers how to use IR system.",  
  45.                 Field.Store.YES, Field.Index.ANALYZED));  
  46.         writer.addDocument(doc);  
  47.         doc = new Document();  
  48.         doc.add(new Field("title""Lucene tutorial",  
  49.                 Field.Store.YES, Field.Index.ANALYZED));  
  50.         doc.add(new Field("subject""This document teaches you how to use lucene.",  
  51.                 Field.Store.YES, Field.Index.ANALYZED));  
  52.         writer.addDocument(doc);  
  53.         writer.close();  
  54.         IndexReader reader = IndexReader.open(directory);  
  55.         searcher = new IndexSearcher(reader);  
  56.     }  
  57.       
  58.     @Override  
  59.     protected void tearDown() throws Exception  
  60.     {  
  61.         searcher.close();  
  62.     }  
  63.       
  64.     public void testDefaultOperator() throws Exception   
  65.     {  
  66.         Query query = new MultiFieldQueryParser(LUCENE_VERSION,  
  67.                 new String[] { "title""subject" }, new SimpleAnalyzer(LUCENE_VERSION))  
  68.                 .parse("development");        
  69.         TopDocs hits = searcher.search(query, 10);  
  70.         assertTrue(hits.totalHits==2);        
  71.     }  
  72.       
  73.     public void testSpecifiedOperator() throws Exception {  
  74.         Query query = MultiFieldQueryParser.parse(Version.LUCENE_30, "lucene",  
  75.                 new String[] { "title""subject" }, new BooleanClause.Occur[] {  
  76.                         BooleanClause.Occur.MUST, BooleanClause.Occur.MUST },  
  77.                 new SimpleAnalyzer(LUCENE_VERSION));  
  78.         TopDocs hits = searcher.search(query, 10);  
  79.   
  80.         assertEquals("one and only one"1, hits.scoreDocs.length);  
  81.         assertEquals("title check", searcher.doc(hits.scoreDocs[0].doc).get("title"), "Lucene tutorial");  
  82.     }  
  83. }  
在上面 testDefaultOperator() 測試方法中, 文件 title 為 "Ant in Action" 與 "Java development quide" 會被找到. 因為它們的 title 或 subject 包含字串 "development"; 而在testSpecifiedOperator() 測試方法中, 只有文件 title 為 "Lucene tutorial" 會被找到, 因為我們設定不管是 title 或是 subject 都必須包含字串 "lucene"

Third approach: 
The third approach for automatically querying across multiple fields is the advanced DisjunctionMaxQuery, which wraps one or more arbitrary queries, OR’ing together the documents they match. You could do this with BooleanQuery, as MultiFieldQueryParser does, but what makes DisjunctionMaxQuery interesting is how it scores each hit: when a document matches more than one query, it computes the score as the maximum score across all the queries that matched, compared to BooleanQuery, which sums the scores of all matching queries. This can produce better end-user relevance

DisjunctionMaxQuery also includes an optional tie-breaker multiplier so that, all things being equal, a document matching more queries will receive a higher score than a document matching fewer queries. To use DisjunctionMaxQuery to query across multiple fields, you create a new field-specific Query, for each field you’d like to include, and then use DisjunctionMaxQuery’s add method to include that Query

Conclusion: 
Which approach makes sense for your application? The answer is “It depends,” because there are important trade-offs. The catchall field is a simple index time–only solution but results in simplistic scoring and may waste disk space by indexing the same text twice. Yet it likely yields the best searching performance.MultiFieldQueryParser produces BooleanQuerys that sum the scores (whereas DisjunctionMaxQuery takes the maximum score) for all queries that match each document, then properly implements per-field boosting. You should test all three approaches, taking into account both search performance and search relevance, to find the best.

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!