程式扎記: [ InAction Note ] Ch6. Extending search - Extending QueryParser

標籤

2014年7月16日 星期三

[ InAction Note ] Ch6. Extending search - Extending QueryParser

Preface: 
In section 3.5, we introduced QueryParser and showed that it has a few settings to control its behavior, such as setting the locale for date parsing and controlling the default phrase slop. QueryParser is also extensible, allowing subclassing to override parts of the query-creation process. In this section, we demonstrate subclassingQueryParser to disallow inefficient wildcard and fuzzy queries, custom date-range handling, and morphing phrase queries into SpanNearQuerys instead ofPhraseQuerys. 

Customizing QueryParser’s behavior 
Although QueryParser has some quirks, such as the interactions with an analyzer, it does have extensibility points that allow for customization. Table 6.2 details the methods designed for overriding and why you may want to do so. 
 

All of the methods listed return a Query, making it possible to construct something other than the current subclass type used by the original implementations of these methods. Also, each of these methods may throw a ParseException, allowing for error handling. 

QueryParser also has extensibility points for instantiating each query type. These differ from the points listed in table 6.2 in that they create the requested query type and return it. Overriding them is useful if you only want to change which Query class is used for each type of query without altering the logic of what query is constructed. These methods are newBooleanQuerynewTermQuerynewPhraseQuerynewMultiPhraseQuerynewPrefixQuerynewFuzzyQuerynewRangeQuery,newMatchAllDocsQuery and newWildcardQuery. For example, if whenever a TermQuery is created by QueryParser you’d like to instantiate your own subclass ofTermQuery, simply override newTermQuery

Prohibiting fuzzy and wildcard queries 
The subclass in listing 6.7 demonstrates a custom query parser subclass that disables fuzzy and wildcard queries by taking advantage of the ParseException option. 
- Listing 6.7 Disallowing wildcard and fuzzy queries 
  1. package demo.ch6;  
  2.   
  3. import org.apache.lucene.analysis.Analyzer;  
  4. import org.apache.lucene.queryparser.classic.ParseException;  
  5. import org.apache.lucene.queryparser.classic.QueryParser;  
  6. import org.apache.lucene.search.Query;  
  7. import org.apache.lucene.util.Version;  
  8.   
  9. public class CustomQueryParser extends QueryParser {  
  10.     public CustomQueryParser(Version matchVersion, String field, Analyzer analyzer) {  
  11.         super(matchVersion, field, analyzer);  
  12.     }  
  13.   
  14.     @Override  
  15.     protected final Query getWildcardQuery(String field, String termStr) throws ParseException {  
  16.         throw new ParseException("Wildcard not allowed");  
  17.     }  
  18.   
  19.     @Override  
  20.     protected Query getFuzzyQuery(String field, String term, float minSimilarity) throws ParseException {  
  21.         throw new ParseException("Fuzzy queries not allowed");  
  22.     }  
  23. }  
To use this custom parser and prevent users from executing wildcard and fuzzy queries, construct an instance of CustomQueryParser and use it exactly as you wouldQueryParser, as shown in listing 6.8. 
- Listing 6.8 Using a custom QueryParser 
  1. public void testCustomQueryParser() {  
  2.     CustomQueryParser parser = new CustomQueryParser(VER, "field", analyzer);  
  3.     try {  
  4.         parser.parse("a?t");  
  5.         fail("Wildcard queries should not be allowed");  
  6.     } catch (ParseException expected) {  
  7.     }  
  8.     try {  
  9.         parser.parse("xunit~");  
  10.         fail("Fuzzy queries should not be allowed");  
  11.     } catch (ParseException expected) {  
  12.     }  
  13. }  
With this implementation, both of these expensive query types are forbidden, giving you peace of mind in terms of performance and errors that may arise from these queries expanding into too many terms. Our next QueryParser extension enables creation of NumericRangeQuery

Handling numeric field-range queries 
As you learned in chapter 2, Lucene can handily index numeric and date values. Unfortunately, QueryParser is unable to produce the corresponding NumericRangeQueryinstances at search time. Fortunately, it’s simple to subclass QueryParser to do so, as shown in listing 6.9. 
- Listing 6.9 Extending QueryParser to properly handle numeric fields 
  1. package demo.ch6;  
  2.   
  3. import org.apache.lucene.analysis.Analyzer;  
  4. import org.apache.lucene.queryparser.classic.ParseException;  
  5. import org.apache.lucene.queryparser.classic.QueryParser;  
  6. import org.apache.lucene.search.NumericRangeQuery;  
  7. import org.apache.lucene.search.Query;  
  8. import org.apache.lucene.search.TermRangeQuery;  
  9. import org.apache.lucene.util.Version;  
  10.   
  11. public class NumericRangeQueryParser extends QueryParser {  
  12.     public NumericRangeQueryParser(Version matchVersion, String field, Analyzer a) {  
  13.         super(matchVersion, field, a);  
  14.     }  
  15.   
  16.     @Override  
  17.     public Query getRangeQuery(String field, String part1, String part2, boolean sInclusive, boolean eInclusive) throws ParseException {  
  18.         TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field, part1, part2, sInclusive, eInclusive);  
  19.         if ("price".equals(field))   
  20.         {  
  21.             return NumericRangeQuery.newDoubleRange("price",  
  22.                     Double.parseDouble(query.getLowerTerm().utf8ToString()),  
  23.                     Double.parseDouble(query.getUpperTerm().utf8ToString()),  
  24.                     query.includesLower(), query.includesUpper());  
  25.         }   
  26.         else   
  27.         {  
  28.             return query;  
  29.         }  
  30.     }  
  31. }  
Using this approach, you rely on QueryParser to first create the TermRangeQuery, and from that you construct the NumericRangeQuery as needed. Testing ourNumericQueryParser, like this: 
  1. public void testNumericRangeQuery() throws Exception {  
  2.     String expression = "price:[50 TO 90]";  
  3.     QueryParser parser = new NumericRangeQueryParser(VER, "subject", analyzer);  
  4.     Query query = parser.parse(expression);  
  5.     System.out.println(expression + " parsed to " + query);  
  6. }  
yields the expected output (note that the 50 and 90 have been turned into floating point values): 
price:[50 TO 90] parsed to price:[50.0 TO 90.0]

As you’ve seen, extending QueryParser to handle numeric fields was straightforward. Let’s do the same for date fields next. 

Handling date ranges 
QueryParser has built-in logic to detect date ranges: if the terms are valid dates, according to DateFormat.SHORT and lenient parsing within the default or specified locale, the dates are converted to their internal textual representation. By default, this conversion will use the older DateField.dateToString method, which renders each date with millisecond precision; this is likely not what you want. If you invoke QueryParser’s setDateResolution methods to state which DateTools.Resolution your field(s) were indexed with, then QueryParser will use the newer DateTools.dateToString method to translate the dates into strings with the appropriate resolution. If either term fails to parse as a valid date, they’re both used as is for a textual range. 

But despite these two built-in approaches for handling dates, QueryParsers’s date handling hasn’t been updated to handle date fields indexed as NumericField, which is the recommended approach for dates, as described in section 2.6.2. Let’s see how we can once again override newRangeQuery, this time to translate our date-based range searches into the corresponding NumericRangeQuery, shown in listing 6.10. 
- Listing 6.10 Extending QueryParser to handle date fields 
  1. package demo.ch6;  
  2.   
  3. import org.apache.lucene.analysis.Analyzer;  
  4. import org.apache.lucene.queryparser.classic.ParseException;  
  5. import org.apache.lucene.queryparser.classic.QueryParser;  
  6. import org.apache.lucene.search.NumericRangeQuery;  
  7. import org.apache.lucene.search.Query;  
  8. import org.apache.lucene.search.TermRangeQuery;  
  9. import org.apache.lucene.util.Version;  
  10.   
  11. public class NumericDateRangeQueryParser extends QueryParser {  
  12.     public NumericDateRangeQueryParser(Version matchVersion, String field, Analyzer a) {  
  13.         super(matchVersion, field, a);  
  14.     }  
  15.   
  16.     @Override  
  17.     public Query getRangeQuery(String field, String part1, String part2, boolean sInc, boolean eInc) throws ParseException {  
  18.         TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field, part1, part2, sInc, eInc);  
  19.         if ("pubmonth".equals(field))   
  20.         {  
  21.             DateFormat dateformat = new SimpleDateFormat("yyyyMM");  
  22.             Resolution res = this.getDateResolution("pubmonth");  
  23.             System.out.printf("\t[Test] Resolution=%s\n", res);  
  24.             System.out.printf("\t[Test] Lower Term: %s\n", query.getLowerTerm().utf8ToString());  
  25.             System.out.printf("\t[Test] Upper Term: %s\n", query.getUpperTerm().utf8ToString());  
  26.             try  
  27.             {  
  28.                 return NumericRangeQuery.newLongRange("pubmonth",  
  29.                         dateformat.parse(query.getLowerTerm().utf8ToString()).getTime(),  
  30.                         dateformat.parse(query.getUpperTerm().utf8ToString()).getTime(),  
  31.                         query.includesLower(), query.includesUpper());  
  32.             }  
  33.             catch(Exception e){throw new ParseException("");}  
  34.         } else {  
  35.             return query;  
  36.         }  
  37.     }  
  38. }  
In this case it’s still helpful to use QueryParser’s built-in logic for detecting and parsing dates. You simply build on that logic in your subclass by taking the further step to convert the query into a NumericRangeQuery. Note that in order to use this subclass you must call QueryParser.setDateResolution, so that the resulting text terms are created with DateTools, as shown in listing 6.11. 
  1. public void testDateRangeQuery() throws Exception {  
  2.     String expression = "pubmonth:[01/01/2010 TO 06/01/2010]";  
  3.     QueryParser parser = new NumericDateRangeQueryParser(VER, "subject", analyzer);  
  4.     parser.setDateResolution("pubmonth", DateTools.Resolution.MONTH);  
  5.     parser.setLocale(Locale.US);  
  6.     Query query = parser.parse(expression);  
  7.     System.out.println(expression + " parsed to " + query);  
  8.     TopDocs matches = searcher.search(query, 10);  
  9.     assertTrue("expecting at least one result !", matches.totalHits > 0);  
  10. }  
This test produces the following output: 
pubmonth:[01/01/2010 TO 06/01/2010] parsed to pubmonth:[1259596800000 TO 1275321600000]


CONTROLLING THE DATE-PARSING LOCALE 
To change the locale used for date parsing, construct a QueryParser'instance and call setLocale(). Typically the client’s locale would be determined and used instead of the default locale. For example, in a web application the HttpServletRequest object contains the locale set by the client browser. You can use this locale to control the locale used by date parsing in QueryParser, as shown in listing 6.12. 
- Listing 6.12 Using the client locale in a web application 
  1. public class SearchServletFragment extends HttpServlet {  
  2.   protected void doGet(HttpServletRequest request,  
  3.                        HttpServletResponse response)   
  4.       throws ServletException, IOException {  
  5.     QueryParser parser = new NumericDateRangeQueryParser(  
  6.                              Version.LUCENE_30,  
  7.                              "contents",  
  8.                              new StandardAnalyzer(Version.LUCENE_30));  
  9.     parser.setLocale(request.getLocale());  
  10.     parser.setDateResolution(DateTools.Resolution.DAY);  
  11.     Query query = null;  
  12.     try {  
  13.       query = parser.parse(request.getParameter("q"));  
  14.     } catch (ParseException e) {  
  15.       e.printStackTrace(System.err);  
  16.     }  
  17.     TopDocs docs = searcher.search(query, 10);  
  18.   }  
  19. }  
QueryParser’s setLocale is one way in which Lucene facilitates internationalization (often abbreviated as I18N) concerns. Text analysis is another, more important, place where such concerns are handled. Further I18N issues are discussed in section 4.8. Our final QueryParser customization shows how to replace the default PhraseQuerywith SpanNearQuery

Allowing ordered phrase queries 
When QueryParser parses a single term, or terms within double quotes, it delegates the construction of the Query to a getFieldQuery method. Parsing an unquoted term calls the getFieldQuery method without the slop signature (slop makes sense only on multiterm phrase query); parsing a quoted phrase calls the getFieldQuery signature with the slop factor, which internally delegates to the nonslop signature to build the query and then sets the slop appropriately. The Query returned is either aTermQuery or a PhraseQuery, by default, depending on whether one or more tokens are returned from the analyzer. Given enough slop, PhraseQuery will match terms out of order in the original text. There’s no way to force a PhraseQuery to match in order (except with slop of 0 or 1). However, SpanNearQuery does allow in-order matching. A straightforward override of getFieldQuery allows us to replace a PhraseQuery with an ordered SpanNearQuery, shown in listing 6.13. 
- Listing 6.13 Translating PhraseQuery to SpanNearQuery 
  1. protected Query getFieldQuery(String field, String queryText, int slop) throws ParseException {  
  2.     Query orig = super.getFieldQuery(field, queryText, slop); // 1)  
  3.     if (!(orig instanceof PhraseQuery)) {  
  4.         return orig;  // 2)  
  5.     }  
  6.     PhraseQuery pq = (PhraseQuery) orig;  
  7.     Term[] terms = pq.getTerms();  // 3)  
  8.     SpanTermQuery[] clauses = new SpanTermQuery[terms.length];  
  9.     for (int i = 0; i < terms.length; i++) {  
  10.         clauses[i] = new SpanTermQuery(terms[i]);  
  11.     }  
  12.     SpanNearQuery query = new SpanNearQuery(clauses, slop, true); // 4)  
  13.     return query;  
  14. }  
1) We delegate to QueryParser's implementation for analysis and determination of query type.
2) We override PhraseQuery and return anything else right away.
3) We pull all terms from the original PhraseQuery.
4) We create a SpanNearQuery with all the terms from the original PhraseQuery.

Our test case shows that our custom getFieldQuery is effective in creating a SpanNearQuery
  1. public void testPhraseQuery() throws Exception {  
  2.     CQPWithSpanQuery parser = new CQPWithSpanQuery(VER, "contents", analyzer);  
  3.     Query query = parser.parse("singleTerm");  
  4.     assertTrue("TermQuery", query instanceof TermQuery);  
  5.     query = parser.parse("\"phrase test\"");  
  6.     System.out.printf("\t[Test] PhraseQuery: %s\n", query.getClass().getName());  
  7.     assertTrue("SpanNearQuery", query instanceof SpanNearQuery);  
  8. }  
Another possible enhancement would be to add a toggle switch to the custom query parser, allowing the in-order flag to be controlled by the user of the API. 

Supplement: 
Ch5. Advanced search techniques - Span queries (1)

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!