程式扎記: [ InAction Note ] Ch5. Advanced search techniques - Span queries (2)

標籤

2013年5月13日 星期一

[ InAction Note ] Ch5. Advanced search techniques - Span queries (2)


Spans near one another:
PhraseQuery (see section 3.4.6) matches documents that have terms near one another, with a slop factor to allow for intermediate or reversed terms.SpanNearQuery operates similarly to PhraseQuery, with some important differences. SpanNearQuery matches spans that are within a certain number of positions from one another, with a separate flag indicating whether the spans must be in the order specified or can be reversed. The resulting matching spans span from the start position of the first span sequentially to the ending position of the last span. An example of a SpanNearQuery given three SpanTermQuery objects is shown in figure 5.3.


Using SpanTermQuery objects as the SpanQuerys in a SpanNearQuery is much like using a PhraseQuery. The SpanNearQuery slop factor is a bit less confusing than the PhraseQuery slop factor because it doesn’t require at least two additional positions to account for a reversed span. To reverse a SpanNearQuery, set the inOrderflag (third argument to the constructor) to false. Listing 5.10 demonstrates a few variations of SpanNearQuery and shows it in relation to PhraseQuery.
- Listing 5.10 Finding matches near one another using SpanNearQuery
  1. public void testSpanNearQuery() throws Exception {  
  2.     // (1)  
  3.     SpanQuery[] quick_brown_dog = new SpanQuery[] { quick, brown, dog };  
  4.     SpanNearQuery snq = new SpanNearQuery(quick_brown_dog, 0true);  
  5.     assertNoMatches(snq);  
  6.     dumpSpans(snq);  
  7.       
  8.     // (2)  
  9.     snq = new SpanNearQuery(quick_brown_dog, 4true);  
  10.     assertNoMatches(snq);  
  11.     dumpSpans(snq);  
  12.       
  13.     // (3)  
  14.     snq = new SpanNearQuery(quick_brown_dog, 5true);  
  15.     assertOnlyBrownFox(snq);  
  16.     dumpSpans(snq);  
  17.       
  18.     // (4)  
  19.     // interesting - even a sloppy phrase query would require  
  20.     // more slop to match  
  21.     snq = new SpanNearQuery(new SpanQuery[] { lazy, fox }, 3false);  
  22.     assertOnlyBrownFox(snq);  
  23.     dumpSpans(snq);  
  24.   
  25.     // (5)  
  26.     PhraseQuery pq = new PhraseQuery();  
  27.     pq.add(new Term("f""lazy"));  
  28.     pq.add(new Term("f""fox"));  
  29.     pq.setSlop(4);  
  30.     assertNoMatches(pq);  
  31.       
  32.     // (6)  
  33.     pq.setSlop(5);  
  34.     assertOnlyBrownFox(pq);  
  35. }  
(1) Querying for these three terms in successive positions doesn’t match either document.
(2) Using the same terms with a slop of 4 positions still doesn’t result in a match.
(3) With a slop of 5, the SpanNearQuery has a match.
(4) The nested SpanTermQuery objects are in reverse order, so the inOrder flag is set to false. A slop of only 3 is needed for a match.
(5) Here we use a comparable PhraseQuery, although a slop of 4 still doesn’t match.
(6) A slop of 5 is needed for a PhraseQuery to match.

We’ve only shown SpanNearQuery with nested SpanTermQuerys, but SpanNearQuery allows for any SpanQuery type. A more sophisticated SpanNearQuery example is demonstrated later in listing 5.11 in conjunction with SpanOrQuery. Next we visit SpanNotQuery.

Excluding span overlap from matches:
The SpanNotQuery excludes matches where one SpanQuery overlaps another. The following code demonstrates:
  1. public void testSpanNotQuery() throws Exception {  
  2.     SpanNearQuery quick_fox = new SpanNearQuery(new SpanQuery[]{quick, fox}, 1true);  
  3.     assertBothFoxes(quick_fox);  
  4.     dumpSpans(quick_fox);  
  5.     SpanNotQuery quick_fox_dog = new SpanNotQuery(quick_fox, dog);  
  6.     assertBothFoxes(quick_fox_dog);  
  7.     dumpSpans(quick_fox_dog);  
  8.     SpanNotQuery no_quick_red_fox = new SpanNotQuery(quick_fox, red);  
  9.     assertOnlyBrownFox(no_quick_red_fox);  
  10.     dumpSpans(no_quick_red_fox);  
  11. }  
The first argument to the SpanNotQuery constructor is a span to include, and the second argument is a span to exclude. Below is the output:
spanNear([f:quick, f:fox], 1, true):
the jumps over the lazy dog (0.18579213)
the jumps over the sleepy cat (0.18579213)

spanNot(spanNear([f:quick, f:fox], 1, true), f:dog):
the jumps over the lazy dog (0.18579213)
the jumps over the sleepy cat (0.18579213)

spanNot(spanNear([f:quick, f:fox], 1, true), f:red):
the jumps over the lazy dog (0.18579213)

The SpanNearQuery matched both documents because both have quick and fox within one position of each other. The first SpanNotQueryquick_fox_dog, continues to match both documents because there’s no overlap with the quick_fox span and dog. The second SpanNotQueryno_quick_red_fox, excludes the second document because red overlaps with the quick_fox span. Notice that the resulting span matches are the original included span. The excluded span is only used to determine if there’s an overlap and doesn’t factor into the resulting span matches.

SpanOrQuery:
Finally let’s talk about SpanOrQuery, which aggregates an array of SpanQuerys. Our example query, in English, is all documents that have “quick fox” near “lazy dog” or that have “quick fox” near “sleepy cat.” The first clause of this query is shown in figure 5.4. This single clause is SpanNearQuery nesting two SpanNearQuery, and each consists of two SpanTermQuerys.


Our test case becomes a bit lengthier due to all the sub-SpanQuerys being built on:
- Listing 5.11 Taking the union of two span queries using SpanOrQuery
  1. public void testSpanOrQuery() throws Exception {  
  2.     SpanNearQuery quick_fox = new SpanNearQuery(new SpanQuery[] { quick,fox }, 1true);  
  3.     SpanNearQuery lazy_dog = new SpanNearQuery(new SpanQuery[] { lazy, dog }, 0true);  
  4.     SpanNearQuery sleepy_cat = new SpanNearQuery(new SpanQuery[] { sleepy, cat }, 0true);  
  5.   
  6.     SpanNearQuery qf_near_ld = new SpanNearQuery(new SpanQuery[] {quick_fox, lazy_dog }, 3true);  
  7.     assertOnlyBrownFox(qf_near_ld);       
  8.     dumpSpans(qf_near_ld);  
  9.       
  10.     SpanNearQuery qf_near_sc = new SpanNearQuery(new SpanQuery[] {quick_fox, sleepy_cat }, 3true);  
  11.     dumpSpans(qf_near_sc);  
  12.       
  13.     SpanOrQuery or = new SpanOrQuery(new SpanQuery[] {qf_near_ld, qf_near_sc });  
  14.     assertBothFoxes(or);  
  15.     dumpSpans(or);  
  16. }  
Here’s the output, followed by our analysis of it:
spanNear([spanNear([f:quick, f:fox], 1, true), spanNear([f:lazy, f:dog], 0, true)], 3, true):
the (0.3321948)

spanNear([spanNear([f:quick, f:fox], 1, true), spanNear([f:sleepy, f:cat], 0, true)], 3, true):
the (0.3321948)

spanOr([spanNear([spanNear([f:quick, f:fox], 1, true), spanNear([f:lazy, f:dog], 0, true)], 3, true), spanNear([spanNear([f:quick, f:fox], 1, true), spanNear([f:sleepy, f:cat], 0, true)], 3, true)]):
the (0.5405281)
the (0.5405281)

Two SpanNearQuerys are created to match “quick fox” near “lazy dog” (qf_near_ld) and “quick fox” near “sleepy cat” (qf_near_sc) using nested SpanNearQuerys made up of SpanTermQuerys at the lowest level. Finally, these two SpanNearQuery instances are combined within a SpanOrQuery, which aggregates all matching spans.

SpanQuery and QueryParser:
QueryParser doesn’t currently support any of the SpanQuery types, but the surround QueryParser in Lucene’s contrib modules does. We cover the surround parser insection 9.6.

Recall from section 3.4.6 that PhraseQuery is impartial to term order when enough slop is specified. Interestingly, you can easily extend QueryParser to use aSpanNearQuery with SpanTermQuery clauses instead, and force phrase queries to only match fields with the terms in the same order as specified. We demonstrate this technique in section 6.3.5.

Supplement:
Ch5. Advanced search techniques - Span queries (1)
- Building block of spanning, SpanTermQuery
- Finding spans at the beginning of a field

Ch5. Advanced search techniques - Span queries (2)
- Spans near one another - SpanNearQuery
- Excluding span overlap from matches - SpanNotQuery
- Aggregates an array of SpanQuery - SpanOrQuery

This message was edited 20 times. Last update was at 14/05/2013 10:26:19

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!