In section 3.5, we introduced QueryParser and showed that it has a few settings to control its behavior, such as setting the locale for date parsing and controlling the default phrase slop. QueryParser is also extensible, allowing subclassing to override parts of the query-creation process. In this section, we demonstrate subclassingQueryParser to disallow inefficient wildcard and fuzzy queries, custom date-range handling, and morphing phrase queries into SpanNearQuerys instead ofPhraseQuerys.
Customizing QueryParser’s behavior
Although QueryParser has some quirks, such as the interactions with an analyzer, it does have extensibility points that allow for customization. Table 6.2 details the methods designed for overriding and why you may want to do so.
All of the methods listed return a Query, making it possible to construct something other than the current subclass type used by the original implementations of these methods. Also, each of these methods may throw a ParseException, allowing for error handling.
QueryParser also has extensibility points for instantiating each query type. These differ from the points listed in table 6.2 in that they create the requested query type and return it. Overriding them is useful if you only want to change which Query class is used for each type of query without altering the logic of what query is constructed. These methods are newBooleanQuery, newTermQuery, newPhraseQuery, newMultiPhraseQuery, newPrefixQuery, newFuzzyQuery, newRangeQuery,newMatchAllDocsQuery and newWildcardQuery. For example, if whenever a TermQuery is created by QueryParser you’d like to instantiate your own subclass ofTermQuery, simply override newTermQuery.
Prohibiting fuzzy and wildcard queries
The subclass in listing 6.7 demonstrates a custom query parser subclass that disables fuzzy and wildcard queries by taking advantage of the ParseException option.
- Listing 6.7 Disallowing wildcard and fuzzy queries
- package demo.ch6;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.queryparser.classic.ParseException;
- import org.apache.lucene.queryparser.classic.QueryParser;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.util.Version;
- public class CustomQueryParser extends QueryParser {
- public CustomQueryParser(Version matchVersion, String field, Analyzer analyzer) {
- super(matchVersion, field, analyzer);
- }
- @Override
- protected final Query getWildcardQuery(String field, String termStr) throws ParseException {
- throw new ParseException("Wildcard not allowed");
- }
- @Override
- protected Query getFuzzyQuery(String field, String term, float minSimilarity) throws ParseException {
- throw new ParseException("Fuzzy queries not allowed");
- }
- }
- Listing 6.8 Using a custom QueryParser
- public void testCustomQueryParser() {
- CustomQueryParser parser = new CustomQueryParser(VER, "field", analyzer);
- try {
- parser.parse("a?t");
- fail("Wildcard queries should not be allowed");
- } catch (ParseException expected) {
- }
- try {
- parser.parse("xunit~");
- fail("Fuzzy queries should not be allowed");
- } catch (ParseException expected) {
- }
- }
Handling numeric field-range queries
As you learned in chapter 2, Lucene can handily index numeric and date values. Unfortunately, QueryParser is unable to produce the corresponding NumericRangeQueryinstances at search time. Fortunately, it’s simple to subclass QueryParser to do so, as shown in listing 6.9.
- Listing 6.9 Extending QueryParser to properly handle numeric fields
- package demo.ch6;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.queryparser.classic.ParseException;
- import org.apache.lucene.queryparser.classic.QueryParser;
- import org.apache.lucene.search.NumericRangeQuery;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.TermRangeQuery;
- import org.apache.lucene.util.Version;
- public class NumericRangeQueryParser extends QueryParser {
- public NumericRangeQueryParser(Version matchVersion, String field, Analyzer a) {
- super(matchVersion, field, a);
- }
- @Override
- public Query getRangeQuery(String field, String part1, String part2, boolean sInclusive, boolean eInclusive) throws ParseException {
- TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field, part1, part2, sInclusive, eInclusive);
- if ("price".equals(field))
- {
- return NumericRangeQuery.newDoubleRange("price",
- Double.parseDouble(query.getLowerTerm().utf8ToString()),
- Double.parseDouble(query.getUpperTerm().utf8ToString()),
- query.includesLower(), query.includesUpper());
- }
- else
- {
- return query;
- }
- }
- }
- public void testNumericRangeQuery() throws Exception {
- String expression = "price:[50 TO 90]";
- QueryParser parser = new NumericRangeQueryParser(VER, "subject", analyzer);
- Query query = parser.parse(expression);
- System.out.println(expression + " parsed to " + query);
- }
As you’ve seen, extending QueryParser to handle numeric fields was straightforward. Let’s do the same for date fields next.
Handling date ranges
QueryParser has built-in logic to detect date ranges: if the terms are valid dates, according to DateFormat.SHORT and lenient parsing within the default or specified locale, the dates are converted to their internal textual representation. By default, this conversion will use the older DateField.dateToString method, which renders each date with millisecond precision; this is likely not what you want. If you invoke QueryParser’s setDateResolution methods to state which DateTools.Resolution your field(s) were indexed with, then QueryParser will use the newer DateTools.dateToString method to translate the dates into strings with the appropriate resolution. If either term fails to parse as a valid date, they’re both used as is for a textual range.
But despite these two built-in approaches for handling dates, QueryParsers’s date handling hasn’t been updated to handle date fields indexed as NumericField, which is the recommended approach for dates, as described in section 2.6.2. Let’s see how we can once again override newRangeQuery, this time to translate our date-based range searches into the corresponding NumericRangeQuery, shown in listing 6.10.
- Listing 6.10 Extending QueryParser to handle date fields
- package demo.ch6;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.queryparser.classic.ParseException;
- import org.apache.lucene.queryparser.classic.QueryParser;
- import org.apache.lucene.search.NumericRangeQuery;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.TermRangeQuery;
- import org.apache.lucene.util.Version;
- public class NumericDateRangeQueryParser extends QueryParser {
- public NumericDateRangeQueryParser(Version matchVersion, String field, Analyzer a) {
- super(matchVersion, field, a);
- }
- @Override
- public Query getRangeQuery(String field, String part1, String part2, boolean sInc, boolean eInc) throws ParseException {
- TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field, part1, part2, sInc, eInc);
- if ("pubmonth".equals(field))
- {
- DateFormat dateformat = new SimpleDateFormat("yyyyMM");
- Resolution res = this.getDateResolution("pubmonth");
- System.out.printf("\t[Test] Resolution=%s\n", res);
- System.out.printf("\t[Test] Lower Term: %s\n", query.getLowerTerm().utf8ToString());
- System.out.printf("\t[Test] Upper Term: %s\n", query.getUpperTerm().utf8ToString());
- try
- {
- return NumericRangeQuery.newLongRange("pubmonth",
- dateformat.parse(query.getLowerTerm().utf8ToString()).getTime(),
- dateformat.parse(query.getUpperTerm().utf8ToString()).getTime(),
- query.includesLower(), query.includesUpper());
- }
- catch(Exception e){throw new ParseException("");}
- } else {
- return query;
- }
- }
- }
- public void testDateRangeQuery() throws Exception {
- String expression = "pubmonth:[01/01/2010 TO 06/01/2010]";
- QueryParser parser = new NumericDateRangeQueryParser(VER, "subject", analyzer);
- parser.setDateResolution("pubmonth", DateTools.Resolution.MONTH);
- parser.setLocale(Locale.US);
- Query query = parser.parse(expression);
- System.out.println(expression + " parsed to " + query);
- TopDocs matches = searcher.search(query, 10);
- assertTrue("expecting at least one result !", matches.totalHits > 0);
- }
CONTROLLING THE DATE-PARSING LOCALE
To change the locale used for date parsing, construct a QueryParser'instance and call setLocale(). Typically the client’s locale would be determined and used instead of the default locale. For example, in a web application the HttpServletRequest object contains the locale set by the client browser. You can use this locale to control the locale used by date parsing in QueryParser, as shown in listing 6.12.
- Listing 6.12 Using the client locale in a web application
- public class SearchServletFragment extends HttpServlet {
- protected void doGet(HttpServletRequest request,
- HttpServletResponse response)
- throws ServletException, IOException {
- QueryParser parser = new NumericDateRangeQueryParser(
- Version.LUCENE_30,
- "contents",
- new StandardAnalyzer(Version.LUCENE_30));
- parser.setLocale(request.getLocale());
- parser.setDateResolution(DateTools.Resolution.DAY);
- Query query = null;
- try {
- query = parser.parse(request.getParameter("q"));
- } catch (ParseException e) {
- e.printStackTrace(System.err);
- }
- TopDocs docs = searcher.search(query, 10);
- }
- }
Allowing ordered phrase queries
When QueryParser parses a single term, or terms within double quotes, it delegates the construction of the Query to a getFieldQuery method. Parsing an unquoted term calls the getFieldQuery method without the slop signature (slop makes sense only on multiterm phrase query); parsing a quoted phrase calls the getFieldQuery signature with the slop factor, which internally delegates to the nonslop signature to build the query and then sets the slop appropriately. The Query returned is either aTermQuery or a PhraseQuery, by default, depending on whether one or more tokens are returned from the analyzer. Given enough slop, PhraseQuery will match terms out of order in the original text. There’s no way to force a PhraseQuery to match in order (except with slop of 0 or 1). However, SpanNearQuery does allow in-order matching. A straightforward override of getFieldQuery allows us to replace a PhraseQuery with an ordered SpanNearQuery, shown in listing 6.13.
- Listing 6.13 Translating PhraseQuery to SpanNearQuery
- protected Query getFieldQuery(String field, String queryText, int slop) throws ParseException {
- Query orig = super.getFieldQuery(field, queryText, slop); // 1)
- if (!(orig instanceof PhraseQuery)) {
- return orig; // 2)
- }
- PhraseQuery pq = (PhraseQuery) orig;
- Term[] terms = pq.getTerms(); // 3)
- SpanTermQuery[] clauses = new SpanTermQuery[terms.length];
- for (int i = 0; i < terms.length; i++) {
- clauses[i] = new SpanTermQuery(terms[i]);
- }
- SpanNearQuery query = new SpanNearQuery(clauses, slop, true); // 4)
- return query;
- }
Our test case shows that our custom getFieldQuery is effective in creating a SpanNearQuery:
- public void testPhraseQuery() throws Exception {
- CQPWithSpanQuery parser = new CQPWithSpanQuery(VER, "contents", analyzer);
- Query query = parser.parse("singleTerm");
- assertTrue("TermQuery", query instanceof TermQuery);
- query = parser.parse("\"phrase test\"");
- System.out.printf("\t[Test] PhraseQuery: %s\n", query.getClass().getName());
- assertTrue("SpanNearQuery", query instanceof SpanNearQuery);
- }
Supplement:
* Ch5. Advanced search techniques - Span queries (1)
沒有留言:
張貼留言