程式扎記: [ InAction Note ] Ch5. Advanced search techniques - Span queries (1)

標籤

2013年5月13日 星期一

[ InAction Note ] Ch5. Advanced search techniques - Span queries (1)


Preface:
Lucene includes a whole family of queries based on SpanQuery, loosely mirroring the normal Lucene Query classes. A span in this context is a starting and ending token position in a field. Recall from section 4.2.1 that tokens emitted during the analysis process include a position from the previous token. This position information, in conjunction with the new SpanQuery subclasses, allows for even more query discrimination and sophistication, such as all documents where “President Obama” is near “health care reform.”

Using the query types we’ve discussed thus far, it isn’t possible to formulate such a position-aware query. You could get close with something like "president obama" AND "health care reform", but these phrases may be too distant from one another within the document to be relevant for our searching purposes. In typical applications,SpanQuerys are used to provide richer, more expressive position-aware functionality than PhraseQuery. They’re also commonly used in conjunction with payloads, covered in section 6.5, to enable access to the payloads created during indexing.

While searching, span queries track more than the documents that match: the individual spans, perhaps more than one per field, are also tracked. Contrasting withTermQuery, which simply matches documents, SpanTermQuery matches exactly the same documents but also keeps track of the positions of every term occurrence that matches. Generally this is more compute-intensive. For example, when TermQuery finds a document containing its term, it records that document as a match and immediately moves on, whereas SpanTermQuery must enumerate all the occurrences of that term within the document.

There are six subclasses of the base SpanQuery, shown in table 5.1. We’ll discuss these SpanQuery types with a simple example, shown in listing 5.8: we’ll index two documents, one with the phrase “the quick brown fox jumps over the lazy dog” and the other with the similar phrase “the quick red fox jumps over the sleepy cat.” We’ll create a separate SpanTermQuery for each of the terms in these documents, as well as three helper assert methods. Finally, we’ll create the different types of span queries to illustrate their functions.


- Listing 5.8 SpanQuery demonstration infrastructure
  1. package ch5;  
  2.   
  3. import junit.framework.TestCase;  
  4. import org.apache.lucene.analysis.SimpleAnalyzer;  
  5. import org.apache.lucene.document.Document;  
  6. import org.apache.lucene.document.Field;  
  7. import org.apache.lucene.index.IndexReader;  
  8. import org.apache.lucene.index.IndexWriter;  
  9. import org.apache.lucene.index.IndexWriterConfig;  
  10. import org.apache.lucene.index.Term;  
  11. import org.apache.lucene.search.IndexSearcher;  
  12. import org.apache.lucene.search.Query;  
  13. import org.apache.lucene.search.TopDocs;  
  14. import org.apache.lucene.search.spans.SpanTermQuery;  
  15. import org.apache.lucene.store.Directory;  
  16. import org.apache.lucene.store.RAMDirectory;  
  17. import org.apache.lucene.util.Version;  
  18.   
  19. public class SpanQueryTest extends TestCase {  
  20.     private IndexSearcher searcher;  
  21.     private SimpleAnalyzer analyzer = null;  
  22.     public static Version LUCENE_VERSION = Version.LUCENE_30;  
  23.     private SpanTermQuery quick;  
  24.     private SpanTermQuery brown;  
  25.     private SpanTermQuery red;  
  26.     private SpanTermQuery fox;  
  27.     private SpanTermQuery lazy;  
  28.     private SpanTermQuery sleepy;  
  29.     private SpanTermQuery dog;  
  30.     private SpanTermQuery cat;  
  31.   
  32.     protected void setUp() throws Exception {  
  33.         Directory directory = new RAMDirectory();  
  34.         analyzer = new SimpleAnalyzer(LUCENE_VERSION);  
  35.         IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, analyzer);  
  36.         IndexWriter writer = new IndexWriter(directory, iwConfig);  
  37.           
  38.         Document doc = new Document();  
  39.         doc.add(new Field("f""the quick brown fox jumps over the lazy dog",  
  40.                 Field.Store.YES, Field.Index.ANALYZED));  
  41.         writer.addDocument(doc);  
  42.         doc = new Document();  
  43.         doc.add(new Field("f""the quick red fox jumps over the sleepy cat",  
  44.                 Field.Store.YES, Field.Index.ANALYZED));  
  45.         writer.addDocument(doc);  
  46.         writer.close();  
  47.         IndexReader reader = IndexReader.open(directory);  
  48.         searcher = new IndexSearcher(reader);  
  49.         quick = new SpanTermQuery(new Term("f""quick"));  
  50.         brown = new SpanTermQuery(new Term("f""brown"));  
  51.         red = new SpanTermQuery(new Term("f""red"));  
  52.         fox = new SpanTermQuery(new Term("f""fox"));  
  53.         lazy = new SpanTermQuery(new Term("f""lazy"));  
  54.         sleepy = new SpanTermQuery(new Term("f""sleepy"));  
  55.         dog = new SpanTermQuery(new Term("f""dog"));  
  56.         cat = new SpanTermQuery(new Term("f""cat"));  
  57.     }  
  58.       
  59.     @Override  
  60.     protected void tearDown() throws Exception  
  61.     {  
  62.         searcher.close();  
  63.     }  
  64.   
  65.     private void assertOnlyBrownFox(Query query) throws Exception {  
  66.         TopDocs hits = searcher.search(query, 10);  
  67.         assertEquals(1, hits.totalHits);  
  68.         assertEquals("wrong doc"0, hits.scoreDocs[0].doc);  
  69.     }  
  70.   
  71.     private void assertBothFoxes(Query query) throws Exception {  
  72.         TopDocs hits = searcher.search(query, 10);  
  73.         assertEquals(2, hits.totalHits);  
  74.     }  
  75.   
  76.     private void assertNoMatches(Query query) throws Exception {  
  77.         TopDocs hits = searcher.search(query, 10);  
  78.         assertEquals(0, hits.totalHits);  
  79.     }  
  80. }  
Building block of spanning, SpanTermQuery:
Span queries need an initial leverage point, and SpanTermQuery is just that. Internally, a SpanQuery keeps track of its matches: a series of start/end positions for each matching document. By itself, a SpanTermQuery matches documents just like TermQuery does, but it also keeps track of position of the same terms that appear within each document. Generally you’d never use this query by itself (you’d use TermQuery instead); you only use it as inputs to the other SpanQuery classes. Figure 5.1 illustrates the SpanTermQuery matches for this code below:
  1. public void testSpanTermQuery() throws Exception {  
  2.       assertOnlyBrownFox(brown);  
  3.       dumpSpans(brown);  
  4. }  


The brown SpanTermQuery was created in setUp() because it will be used in other tests that follow. We developed a method, dumpSpans, to visualize spans. ThedumpSpans method uses lower-level SpanQuery APIs to navigate the spans; this lower-level API probably isn’t of much interest to you other than for diagnostic purposes, so we don’t elaborate further. Each SpanQuery subclass sports a useful toString() for diagnostic purposes, which dumpSpans uses, as seen in listing 5.9.
- Listing 5.9 dumpSpans method, used to see all spans matched by any SpanQuery
  1. private void dumpSpans(SpanQuery query) throws IOException {  
  2.     Spans spans = query.getSpans(searcher.getIndexReader());  
  3.     System.out.println(query + ":");  
  4.     int numSpans = 0;  
  5.     TopDocs hits = searcher.search(query, 10);  
  6.     float[] scores = new float[2];  
  7.     for (ScoreDoc sd : hits.scoreDocs) {  
  8.         scores[sd.doc] = sd.score;  
  9.     }  
  10.     while (spans.next()) {  
  11.         numSpans++;  
  12.         int id = spans.doc();  
  13.         Document doc = searcher.getIndexReader().document(id);  
  14.         TokenStream stream = analyzer.tokenStream("contents",  
  15.                 new StringReader(doc.get("f")));  
  16.         TermAttribute term = stream.addAttribute(TermAttribute.class);  
  17.         StringBuilder buffer = new StringBuilder();  
  18.         buffer.append("   ");  
  19.         int i = 0;  
  20.         while (stream.incrementToken()) {  
  21.             if (i == spans.start()) {  
  22.                 buffer.append("<");  
  23.             }  
  24.             buffer.append(term.term());  
  25.             if (i + 1 == spans.end()) {  
  26.                 buffer.append(">");  
  27.             }  
  28.             buffer.append(" ");  
  29.             i++;  
  30.         }  
  31.         buffer.append("(").append(scores[id]).append(") ");  
  32.         System.out.println(buffer);  
  33.     }  
  34.     if (numSpans == 0) {  
  35.         System.out.println("   No spans");  
  36.     }  
  37.     System.out.println();  
  38. }  
The output of dumpSpans(brown) is:
f:brown:
the quick fox jumps over the lazy dog (0.22097087)

More interesting is the dumpSpans output from a SpanTermQuery for "the":
f:the:
quick brown fox jumps over the lazy dog (0.18579213)
the quick brown fox jumps over lazy dog (0.18579213)
quick red fox jumps over the sleepy cat (0.18579213)
the quick red fox jumps over sleepy cat (0.18579213)

Not only were both documents matched, but also each document had two span matches highlighted by the brackets. The basic SpanTermQuery is used as a building block of the other SpanQuery types. Let’s see how to match only documents where the terms of interest occur in the beginning of the field.

Finding spans at the beginning of a field:
To query for spans that occur within the first specific number of positions of a field, use SpanFirstQuery. Figure 5.2 illustrates a SpanFirstQuery case as below:
  1. public void testSpanFirstQuery() throws Exception {  
  2.     SpanFirstQuery sfq = new SpanFirstQuery(brown, 2);  
  3.     assertNoMatches(sfq);  
  4.     dumpSpans(sfq);  
  5.     sfq = new SpanFirstQuery(brown, 3);  
  6.     dumpSpans(sfq);  
  7.     assertOnlyBrownFox(sfq);  
  8. }  


No matches are found in the first query because the range of 2 is too short to find brown, but the range of 3 is just long enough to cause a match in the second query (see figure 5.2). Any SpanQuery can be used within a SpanFirstQuery, with matches for spans that have an ending position in the first specified number (2 and 3 in this case) of positions. The output of testSpanFirstQuery() as below:
spanFirst(f:brown, 2):
No spans

spanFirst(f:brown, 3):
the quick fox jumps over the lazy dog (0.22097087)

Supplement:
Ch5. Advanced search techniques - Span queries (1)
- Building block of spanning, SpanTermQuery
- Finding spans at the beginning of a field

Ch5. Advanced search techniques - Span queries (2)
- Spans near one another - SpanNearQuery
- Excluding span overlap from matches - SpanNotQuery
- Aggregates an array of SpanQuery - SpanOrQuery

This message was edited 17 times. Last update was at 14/05/2013 10:26:48

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!