In our book data, several fields were indexed to separately hold the title, category, author, subject, and so forth. But when searching a user would typically like to search across all fields at once. You could require users to spell out each field name, but except for specialized cases, that’s requiring far too much work on your users’ part. Users much prefer to search all fields, by default, unless a specific field is requested. We cover three possible approaches here.
The first approach is to create a multivalued catchall field to index the text from all fields, as we’ve done for the contents field in our book test index. Be sure to increase the position increment gap across field values, as described in section 4.7.1, to avoid incorrectly matching across two field values. You then perform all searching against the catchall field. This approach has some downsides: you can’t directly control per-field boosting 1 , and disk space is wasted, assuming you also index each field separately.
The second approach is to use MultiFieldQueryParser, which subclasses QueryParser. Under the covers, it instantiates a QueryParser, parses the query expression for each field, then combines the resulting queries using a BooleanQuery. The default operator OR is used in the simplest parse method when adding the clauses to theBooleanQuery. For finer control, the operator can be specified for each field as required (BooleanClause.Occur.MUST), prohibited (BooleanClause.Occur.MUST_NOT), or normal (BooleanClause.Occur.SHOULD), using the constants from BooleanClause.
Listing 5.7 shows this heavier QueryParser variant in use. The testDefaultOperator() method first parses the query "development" using both the title and subject fields. The test shows that documents match based on either of those fields. The second test, testSpecifiedOperator(), sets the parsing to mandate that documents must match the expression in all specified fields and searches using the query "lucene".
- Listing 5.7 MultiFieldQueryParser, which searches on multiple fields at once
The third approach for automatically querying across multiple fields is the advanced DisjunctionMaxQuery, which wraps one or more arbitrary queries, OR’ing together the documents they match. You could do this with BooleanQuery, as MultiFieldQueryParser does, but what makes DisjunctionMaxQuery interesting is how it scores each hit: when a document matches more than one query, it computes the score as the maximum score across all the queries that matched, compared to BooleanQuery, which sums the scores of all matching queries. This can produce better end-user relevance.
DisjunctionMaxQuery also includes an optional tie-breaker multiplier so that, all things being equal, a document matching more queries will receive a higher score than a document matching fewer queries. To use DisjunctionMaxQuery to query across multiple fields, you create a new field-specific Query, for each field you’d like to include, and then use DisjunctionMaxQuery’s add method to include that Query.
Which approach makes sense for your application? The answer is “It depends,” because there are important trade-offs. The catchall field is a simple index time–only solution but results in simplistic scoring and may waste disk space by indexing the same text twice. Yet it likely yields the best searching performance.MultiFieldQueryParser produces BooleanQuerys that sum the scores (whereas DisjunctionMaxQuery takes the maximum score) for all queries that match each document, then properly implements per-field boosting. You should test all three approaches, taking into account both search performance and search relevance, to find the best.