程式扎記: [ InAction Note ] Ch5. Advanced search techniques - Searching across multiple Lucene indexes

標籤

2013年6月25日 星期二

[ InAction Note ] Ch5. Advanced search techniques - Searching across multiple Lucene indexes

Preface: 
Some applications need to maintain separate Lucene indexes, yet want to allow a single search to return combined results from all the indexes. Sometimes, such separation is done for convenience or administrative reasons—for example, if different people or groups maintain the index for different collections of documents. Other times it may be done due to high volume. For example, a news site may make a new index for every month and then choose which months to search over. 

Whatever the reason, Lucene provides two useful classes for searching across multiple indexes. We’ll first meet MultiSearcher, which uses a single thread to perform searching across multiple indexes. Then we’ll see ParallelMultiSearcher, which uses multiple threads to gain concurrency. 

Using MultiSearcher: 
With MultiSearcher, all indexes can be searched with the results merged in a specified (or descending-score, by default) order. Using MultiSearcher is comparable to using IndexSearcher, except that you hand it an array of IndexSearchers to search rather than a single directory (so it’s effectively a decorator pattern and delegates most of the work to the subsearchers). 

Below illustrates how to search two indexes that are split alphabetically by keyword. The index is made up of animal names beginning with each letter of the alphabet. Half the names are in one index, and half are in the other. A search is performed with a range that spans both indexes, demonstrating that results are merged together. 
- Listing 5.17 Securing the search space with a filter 
  1. package ch5;  
  2.   
  3. import java.io.File;  
  4.   
  5. import junit.framework.TestCase;  
  6.   
  7. import org.apache.lucene.analysis.WhitespaceAnalyzer;  
  8. import org.apache.lucene.document.Document;  
  9. import org.apache.lucene.document.Field;  
  10. import org.apache.lucene.index.IndexReader;  
  11. import org.apache.lucene.index.IndexWriter;  
  12. import org.apache.lucene.index.IndexWriterConfig;  
  13. import org.apache.lucene.index.MultiReader;  
  14. import org.apache.lucene.search.IndexSearcher;  
  15. import org.apache.lucene.search.TermRangeQuery;  
  16. import org.apache.lucene.search.TopDocs;  
  17. import org.apache.lucene.store.Directory;  
  18. import org.apache.lucene.store.FSDirectory;  
  19. import org.apache.lucene.util.Version;  
  20.   
  21. public class MultiSearcherTest extends TestCase {  
  22.     public static Version LUCENE_VERSION = Version.LUCENE_30;  
  23.       
  24.       private IndexReader[] readers;  
  25.         
  26.       public void setUp() throws Exception {  
  27.         String[] animals = { "aardvark""beaver""coati",  
  28.                            "dog""elephant""frog""gila monster",  
  29.                            "horse""iguana""javelina""kangaroo",  
  30.                            "lemur""moose""nematode""orca",  
  31.                            "python""quokka""rat""scorpion",  
  32.                            "tarantula""uromastyx""vicuna",  
  33.                            "walrus""xiphias""yak""zebra"};          
  34.         Directory aTOmDirectory = FSDirectory.open(new File("indice/animal_a-m"));  
  35.         Directory nTOzDirectory = FSDirectory.open(new File("indice/animal_n-z"));  
  36.           
  37.         IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, new WhitespaceAnalyzer(LUCENE_VERSION));  
  38.         IndexWriter aTOmWriter = new IndexWriter(aTOmDirectory, iwConfig);  
  39.           
  40.         IndexWriterConfig iwConfig2 = new IndexWriterConfig(LUCENE_VERSION, new WhitespaceAnalyzer(LUCENE_VERSION));  
  41.         IndexWriter nTOzWriter = new IndexWriter(nTOzDirectory, iwConfig2);                                                
  42.         for (int i=animals.length - 1; i >= 0; i--) {  
  43.           Document doc = new Document();  
  44.           String animal = animals[i];  
  45.           doc.add(new Field("animal", animal,  
  46.                   Field.Store.YES, Field.Index.NOT_ANALYZED));  
  47.           if (animal.charAt(0) < 'n') {  
  48.             aTOmWriter.addDocument(doc);  
  49.           } else {                                         
  50.             nTOzWriter.addDocument(doc);  
  51.           }  
  52.         }  
  53.         aTOmWriter.close();  
  54.         nTOzWriter.close();  
  55.         readers = new IndexReader[2];  
  56.         IndexReader areader = IndexReader.open(aTOmDirectory);  
  57.         IndexReader nreader = IndexReader.open(nTOzDirectory);  
  58.         readers[0] = areader;  
  59.         readers[1] = nreader;  
  60.       }  
  61.       public void testMulti() throws Exception {  
  62.         MultiReader  readerX = new MultiReader(readers);  
  63.         IndexSearcher searcher = new IndexSearcher(readerX);  
  64.         TermRangeQuery query = new TermRangeQuery("animal",  
  65.                                                   "h",  
  66.                                                   "t",  
  67.                                                   truetrue);  
  68.         TopDocs hits = searcher.search(query, 10);  
  69.         assertEquals("tarantula not included"12, hits.totalHits);  
  70.       }  
  71.   
  72. }  
The inclusive TermRangeQuery matches animal names that begin with h through animal names that begin with t, with the matching documents coming from both indexes. A related class, ParallelMultiSearcher, achieves the same functionality as MultiSearcher but uses multiple threads to gain concurrency. 

Multithreaded searching using ParallelMultiSearcher: 
A multithreaded version of MultiSearcher, called ParallelMultiSearcher, spawns a new thread for each Searchable and waits for them all to finish when the search method is invoked. The basic search and search with filter options are parallelized, but searching with a Collector hasn’t yet been parallelized. The exposed API is the same as MultiSearcher, so it’s a simple drop-in. 

Whether you’ll see performance gains using ParallelMultiSearcher depends on your architecture. If the indexes reside on different physical disks and your computer has CPU concurrency, you should see improved performance. But there hasn’t been much real-world testing to back this up, so be sure to test it for your application. 

A cousin to ParallelMultiSearcher lives in Lucene’s contrib/remote directory, enabling you to remotely search multiple indexes in parallel. We’ll talk about term vectors next, a topic you’ve already seen on the indexing side in chapter 2.

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!