程式扎記: [ InAction Note ] Ch5. Advanced search techniques

Preface:
In our book data, several fields were indexed to separately hold the title, category, author, subject, and so forth. But when searching a user would typically like to search across all fields at once. You could require users to spell out each field name, but except for specialized cases, that’s requiring far too much work on your users’ part. Users much prefer to search all fields, by default, unless a specific field is requested. We cover three possible approaches here.

First approach:
The first approach is to create a multivalued catchall field to index the text from all fields, as we’ve done for the contents field in our book test index. Be sure to increase the position increment gap across field values, as described in section 4.7.1, to avoid incorrectly matching across two field values. You then perform all searching against the catchall field. This approach has some downsides: you can’t directly control per-field boosting 1 , and disk space is wasted, assuming you also index each field separately.

Second approach:
The second approach is to use MultiFieldQueryParser, which subclasses QueryParser. Under the covers, it instantiates a QueryParser, parses the query expression for each field, then combines the resulting queries using a BooleanQuery. The default operator OR is used in the simplest parse method when adding the clauses to theBooleanQuery. For finer control, the operator can be specified for each field as required (BooleanClause.Occur.MUST), prohibited (BooleanClause.Occur.MUST_NOT), or normal (BooleanClause.Occur.SHOULD), using the constants from BooleanClause.

Listing 5.7 shows this heavier QueryParser variant in use. The testDefaultOperator() method first parses the query "development" using both the title and subject fields. The test shows that documents match based on either of those fields. The second test, testSpecifiedOperator(), sets the parsing to mandate that documents must match the expression in all specified fields and searches using the query "lucene".
- Listing 5.7 MultiFieldQueryParser, which searches on multiple fields at once

view plaincopy to clipboardprint?
package ch5;  
  
import junit.framework.TestCase;  
  
import org.apache.lucene.analysis.SimpleAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.IndexWriterConfig;  
import org.apache.lucene.queryParser.MultiFieldQueryParser;  
import org.apache.lucene.search.BooleanClause;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.TopDocs;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.RAMDirectory;  
import org.apache.lucene.util.Version;  
  
public class MultiFieldQueryParserTest extends TestCase {  
    private IndexSearcher searcher;  
    public static Version LUCENE_VERSION = Version.LUCENE_30;  
  
    @Override  
    protected void setUp() throws Exception {  
        Directory directory = new RAMDirectory();  
        IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, new SimpleAnalyzer(LUCENE_VERSION));  
        IndexWriter writer = new IndexWriter(directory, iwConfig);  
        Document doc = new Document();  
        doc.add(new Field("title",  
                "Ant in Action",  
                Field.Store.YES, Field.Index.ANALYZED));  
        doc.add(new Field("subject", "This help you to use Ant in development.", Field.Store.YES, Field.Index.ANALYZED));  
        writer.addDocument(doc);  
        doc = new Document();  
        doc.add(new Field("title", "Java development quide",  
                Field.Store.YES, Field.Index.ANALYZED));  
        doc.add(new Field("subject", "This document help you to learn java programming",  
                Field.Store.YES, Field.Index.ANALYZED));  
        writer.addDocument(doc);  
        doc = new Document();  
        doc.add(new Field("title", "Lucene in action",  
                Field.Store.YES, Field.Index.ANALYZED));  
        doc.add(new Field("subject", "This document covers how to use IR system.",  
                Field.Store.YES, Field.Index.ANALYZED));  
        writer.addDocument(doc);  
        doc = new Document();  
        doc.add(new Field("title", "Lucene tutorial",  
                Field.Store.YES, Field.Index.ANALYZED));  
        doc.add(new Field("subject", "This document teaches you how to use lucene.",  
                Field.Store.YES, Field.Index.ANALYZED));  
        writer.addDocument(doc);  
        writer.close();  
        IndexReader reader = IndexReader.open(directory);  
        searcher = new IndexSearcher(reader);  
    }  
      
    @Override  
    protected void tearDown() throws Exception  
    {  
        searcher.close();  
    }  
      
    public void testDefaultOperator() throws Exception   
    {  
        Query query = new MultiFieldQueryParser(LUCENE_VERSION,  
                new String[] { "title", "subject" }, new SimpleAnalyzer(LUCENE_VERSION))  
                .parse("development");        
        TopDocs hits = searcher.search(query, 10);  
        assertTrue(hits.totalHits==2);        
    }  
      
    public void testSpecifiedOperator() throws Exception {  
        Query query = MultiFieldQueryParser.parse(Version.LUCENE_30, "lucene",  
                new String[] { "title", "subject" }, new BooleanClause.Occur[] {  
                        BooleanClause.Occur.MUST, BooleanClause.Occur.MUST },  
                new SimpleAnalyzer(LUCENE_VERSION));  
        TopDocs hits = searcher.search(query, 10);  
  
        assertEquals("one and only one", 1, hits.scoreDocs.length);  
        assertEquals("title check", searcher.doc(hits.scoreDocs[0].doc).get("title"), "Lucene tutorial");  
    }  
}  

在上面 testDefaultOperator() 測試方法中, 文件 title 為 "Ant in Action" 與 "Java development quide" 會被找到. 因為它們的 title 或 subject 包含字串 "development"; 而在testSpecifiedOperator() 測試方法中, 只有文件 title 為 "Lucene tutorial" 會被找到, 因為我們設定不管是 title 或是 subject 都必須包含字串 "lucene"!

Third approach:
The third approach for automatically querying across multiple fields is the advanced DisjunctionMaxQuery, which wraps one or more arbitrary queries, OR’ing together the documents they match. You could do this with BooleanQuery, as MultiFieldQueryParser does, but what makes DisjunctionMaxQuery interesting is how it scores each hit: when a document matches more than one query, it computes the score as the maximum score across all the queries that matched, compared to BooleanQuery, which sums the scores of all matching queries. This can produce better end-user relevance.

DisjunctionMaxQuery also includes an optional tie-breaker multiplier so that, all things being equal, a document matching more queries will receive a higher score than a document matching fewer queries. To use DisjunctionMaxQuery to query across multiple fields, you create a new field-specific Query, for each field you’d like to include, and then use DisjunctionMaxQuery’s add method to include that Query.

Conclusion:
Which approach makes sense for your application? The answer is “It depends,” because there are important trade-offs. The catchall field is a simple index time–only solution but results in simplistic scoring and may waste disk space by indexing the same text twice. Yet it likely yields the best searching performance.MultiFieldQueryParser produces BooleanQuerys that sum the scores (whereas DisjunctionMaxQuery takes the maximum score) for all queries that match each document, then properly implements per-field boosting. You should test all three approaches, taking into account both search performance and search relevance, to find the best.

程式扎記

標籤

2013年5月13日星期一

[ InAction Note ] Ch5. Advanced search techniques - Querying on multiple fields at once

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年5月13日 星期一