Searching with Lucene is a surprisingly simple affair. You first create an instance of IndexSearcher, which opens the search index, and then use the search methods on that class to perform all searching. The returned TopDocs class represents the top results, and you use that to present results to the user. Next we discuss how to handle pagination, and finally we show how to use Lucene’s new (as of version 2.9) near-realtime search capability for fast turnaround on recently indexed documents. Let’s begin with the creation of an IndexSearcher.
Creating an IndexSearcher:
The classes involved are shown in figure 3.2:
First, as with indexing, we’ll need a directory. Most often you’re searching an index in the file system:
Note that it’s IndexReader that does all the heavy lifting to open all index files and expose a low-level reader API, while IndexSearcher is a rather thin veneer. Because it’s costly to open an IndexReader, it’s best to reuse a single instance for all of your searches, and open a new one only when necessary.
IndexReader always searches a point-in-time snapshot of the index as it existed when the IndexReader was created. If you need to search changes to the index, you’ll have to open a new reader. Fortunately, the IndexReader.reopen method is a resource-efficient means of obtaining a new IndexReader that covers all changes to the index but shares resources with the current reader when possible. Use it like this:
Once you have an IndexSearcher, simply call one of its search methods to perform a search. Under the hood, the search method does a tremendous amount of work, very quickly. It visits every single document that’s a candidate for matching the search, only accepting the ones that pass every constraint on the query. Finally, it gathers the top results and returns them to you.
The main search methods available to an IndexSearcher instance are shown in table 3.3. Here we only make use of the search(Query, int) method because many applications won’t need to use the more advanced methods. The other search method signatures, including the filtering and sorting variants, are covered in chapter 5.Chapter 6 covers the customizable search methods that accept a Collector for gathering results.
Working with TopDocs:
Now that we’ve called search, we have a TopDocs object at our disposal that we can use for efficient access to the search results. Results are ordered by relevance—in other words, by how well each document matches the query (sorting results in other ways is discussed in section 5.2).
The TopDocs class exposes a small number of methods and attributes for retrieving the search results; they’re listed in table 3.4. The attribute TopDocs.totalHits returns the number of matching documents. The matches, by default, are sorted in decreasing score order. The TopDocs.scoreDocs attribute is an array containing the requested number of top matches. Each ScoreDoc instance has a float score, which is the relevance score, and an int doc, which is the document ID that can be used to retrieve the stored fields for that document by calling IndexSearcher.document(doc). Finally, TopDocs.getMaxScore() returns the best score across all matches; when you sort by relevance (the default), that will always be the score of the first result. But if you sort by other criteria and enable scoring for the search, as described in section 5.2, it will be the maximum score of all matching documents even when the best scoring document isn’t in the top results by your sort criteria.
Paging through results:
Presenting search results to end users most often involves displaying only the first 10 to 20 most relevant documents. Paging through ScoreDocs is a common requirement, although if you find users are frequently doing a lot of paging you should revisit your design: ideally the user almost always finds the result on the first page. That said, pagination is still typically needed. You can choose from a couple of implementation approaches:
* Keep search result:
* Don't keep search result:
Requerying is most often the better solution. Requerying eliminates the need to store per-user state, which in a web application can be costly, especially with a large number of users. Requerying at first glance seems a waste, but Lucene’s blazing speed more than compensates. Also, thanks to the I/O caching in modern operating systems, requerying will typically be fast because the necessary bits from disk will already be cached in RAM. Frequently users don’t click past the first page of results anyway.
One of the new features in Lucene’s 2.9 release is near-real-time search, which enables you to rapidly search changes made to the index with an open IndexWriter, without having to first close or commit changes to that writer. Many applications make ongoing changes with an always open IndexWriter and require that subsequent searches quickly reflect these changes. If that IndexWriter instance is in the same JVM that’s doing searching, you can use near-real-time search, as shown in listing 3.3.
This capability is referred to as near-real-time search, and not simply real-time search, because it’s not possible to make strict guarantees about the turnaround time, in the same sense as a "hard" real-time OS is able to do. Lucene’s near-real-time search is more like a "soft" real-time OS. For example, if Java decides to run a major garbage collection cycle, or if a large segment merge has just completed, or if your machine is struggling because there’s not enough RAM, the turnaround time of the near-real-time reader can be much longer. But in practice the turnaround time can be very fast (tens of milliseconds or less), depending on your indexing and searching throughput, and how frequently you obtain a new near-real-time reader.
In the past, without this feature, you’d have to call commit on the writer, and then reopen on your reader, but this can be time consuming since commit must sync all new files in the index, an operation that’s often costly on certain operating systems and file systems because it usually means the underlying I/O device must physically write all buffered bytes to stable storage. Near-real-time search enables you to search segments that are newly created but not yet committed. Section 11.1.3gives some tips for further reducing the index-to-search turnaround time.
- Listing 3.3 Near-real-time search