Some applications need to maintain separate Lucene indexes, yet want to allow a single search to return combined results from all the indexes. Sometimes, such separation is done for convenience or administrative reasons—for example, if different people or groups maintain the index for different collections of documents. Other times it may be done due to high volume. For example, a news site may make a new index for every month and then choose which months to search over.
Whatever the reason, Lucene provides two useful classes for searching across multiple indexes. We’ll first meet MultiSearcher, which uses a single thread to perform searching across multiple indexes. Then we’ll see ParallelMultiSearcher, which uses multiple threads to gain concurrency.
With MultiSearcher, all indexes can be searched with the results merged in a specified (or descending-score, by default) order. Using MultiSearcher is comparable to using IndexSearcher, except that you hand it an array of IndexSearchers to search rather than a single directory (so it’s effectively a decorator pattern and delegates most of the work to the subsearchers).
Below illustrates how to search two indexes that are split alphabetically by keyword. The index is made up of animal names beginning with each letter of the alphabet. Half the names are in one index, and half are in the other. A search is performed with a range that spans both indexes, demonstrating that results are merged together.
- Listing 5.17 Securing the search space with a filter
Multithreaded searching using ParallelMultiSearcher:
A multithreaded version of MultiSearcher, called ParallelMultiSearcher, spawns a new thread for each Searchable and waits for them all to finish when the search method is invoked. The basic search and search with filter options are parallelized, but searching with a Collector hasn’t yet been parallelized. The exposed API is the same as MultiSearcher, so it’s a simple drop-in.
Whether you’ll see performance gains using ParallelMultiSearcher depends on your architecture. If the indexes reside on different physical disks and your computer has CPU concurrency, you should see improved performance. But there hasn’t been much real-world testing to back this up, so be sure to test it for your application.
A cousin to ParallelMultiSearcher lives in Lucene’s contrib/remote directory, enabling you to remotely search multiple indexes in parallel. We’ll talk about term vectors next, a topic you’ve already seen on the indexing side in chapter 2.