程式扎記: [ InAction Note ] Ch3. Adding search

Preface:
Searching with Lucene is a surprisingly simple affair. You first create an instance of IndexSearcher, which opens the search index, and then use the search methods on that class to perform all searching. The returned TopDocs class represents the top results, and you use that to present results to the user. Next we discuss how to handle pagination, and finally we show how to use Lucene’s new (as of version 2.9) near-realtime search capability for fast turnaround on recently indexed documents. Let’s begin with the creation of an IndexSearcher.

Creating an IndexSearcher:
The classes involved are shown in figure 3.2:

First, as with indexing, we’ll need a directory. Most often you’re searching an index in the file system:

view plaincopy to clipboardprint?
Directory dir = FSDirectory.open(new File("/path/to/index"));  

Section 2.10 describes alternate Directory implementations. Next we create an IndexReader:

view plaincopy to clipboardprint?
IndexReader reader = IndexReader.open(dir);  

Finally, we create the IndexSearcher:

view plaincopy to clipboardprint?
IndexSearcher searcher = new IndexSearcher(reader);  

Directory, which we’ve already seen in the context of indexing, provides the abstract file-like API. IndexReader uses that API to interact with the index files stored during indexing, and exposes the low-level API that IndexSearcher uses for searching. IndexSearcher’s APIs accept Query objects, for searching, and return TopDocs objects representing the results.

Note that it’s IndexReader that does all the heavy lifting to open all index files and expose a low-level reader API, while IndexSearcher is a rather thin veneer. Because it’s costly to open an IndexReader, it’s best to reuse a single instance for all of your searches, and open a new one only when necessary.

IndexReader always searches a point-in-time snapshot of the index as it existed when the IndexReader was created. If you need to search changes to the index, you’ll have to open a new reader. Fortunately, the IndexReader.reopen method is a resource-efficient means of obtaining a new IndexReader that covers all changes to the index but shares resources with the current reader when possible. Use it like this:

view plaincopy to clipboardprint?
IndexReader newReader = reader.reopen();  
if (reader != newReader) {  
    reader.close();  
    reader = newReader;  
    searcher = new IndexSearcher(reader);  
}  

The reopen method only returns a new reader if there were changes in the index, in which case it’s your responsibility to close the old reader and create a newIndexSearcher. In a real application, where multiple threads may still be searching using the old reader, you’ll have to protect this code to make it thread safe. Section 11.2.2 provides a useful drop-in class that does this for you. Section 3.2.5 shows how to obtain a near-real-time IndexReader from an IndexWriter, which is even more resource efficient in cases where you have access to the IndexWriter making changes to the index.
NOTE:

An IndexSearcher instance searches only the index as it existed at the time the IndexSearcher was instantiated. If indexing is occurring concurrently with searching, newer documents indexed won’t be visible to searches. In order to see the new documents, you should open a new reader.

Performing searches:
Once you have an IndexSearcher, simply call one of its search methods to perform a search. Under the hood, the search method does a tremendous amount of work, very quickly. It visits every single document that’s a candidate for matching the search, only accepting the ones that pass every constraint on the query. Finally, it gathers the top results and returns them to you.

The main search methods available to an IndexSearcher instance are shown in table 3.3. Here we only make use of the search(Query, int) method because many applications won’t need to use the more advanced methods. The other search method signatures, including the filtering and sorting variants, are covered in chapter 5.Chapter 6 covers the customizable search methods that accept a Collector for gathering results.

Working with TopDocs:
Now that we’ve called search, we have a TopDocs object at our disposal that we can use for efficient access to the search results. Results are ordered by relevance—in other words, by how well each document matches the query (sorting results in other ways is discussed in section 5.2).

The TopDocs class exposes a small number of methods and attributes for retrieving the search results; they’re listed in table 3.4. The attribute TopDocs.totalHits returns the number of matching documents. The matches, by default, are sorted in decreasing score order. The TopDocs.scoreDocs attribute is an array containing the requested number of top matches. Each ScoreDoc instance has a float score, which is the relevance score, and an int doc, which is the document ID that can be used to retrieve the stored fields for that document by calling IndexSearcher.document(doc). Finally, TopDocs.getMaxScore() returns the best score across all matches; when you sort by relevance (the default), that will always be the score of the first result. But if you sort by other criteria and enable scoring for the search, as described in section 5.2, it will be the maximum score of all matching documents even when the best scoring document isn’t in the top results by your sort criteria.

Paging through results:
Presenting search results to end users most often involves displaying only the first 10 to 20 most relevant documents. Paging through ScoreDocs is a common requirement, although if you find users are frequently doing a lot of paging you should revisit your design: ideally the user almost always finds the result on the first page. That said, pagination is still typically needed. You can choose from a couple of implementation approaches:
* Keep search result:

Keep the resulting ScoreDocs and IndexSearcher instances available while the user is navigating the search results.

* Don't keep search result:

Requery each time the user navigates to a new page.

Requerying is most often the better solution. Requerying eliminates the need to store per-user state, which in a web application can be costly, especially with a large number of users. Requerying at first glance seems a waste, but Lucene’s blazing speed more than compensates. Also, thanks to the I/O caching in modern operating systems, requerying will typically be fast because the necessary bits from disk will already be cached in RAM. Frequently users don’t click past the first page of results anyway.

Near-real-time search:
One of the new features in Lucene’s 2.9 release is near-real-time search, which enables you to rapidly search changes made to the index with an open IndexWriter, without having to first close or commit changes to that writer. Many applications make ongoing changes with an always open IndexWriter and require that subsequent searches quickly reflect these changes. If that IndexWriter instance is in the same JVM that’s doing searching, you can use near-real-time search, as shown in listing 3.3.

This capability is referred to as near-real-time search, and not simply real-time search, because it’s not possible to make strict guarantees about the turnaround time, in the same sense as a "hard" real-time OS is able to do. Lucene’s near-real-time search is more like a "soft" real-time OS. For example, if Java decides to run a major garbage collection cycle, or if a large segment merge has just completed, or if your machine is struggling because there’s not enough RAM, the turnaround time of the near-real-time reader can be much longer. But in practice the turnaround time can be very fast (tens of milliseconds or less), depending on your indexing and searching throughput, and how frequently you obtain a new near-real-time reader.

In the past, without this feature, you’d have to call commit on the writer, and then reopen on your reader, but this can be time consuming since commit must sync all new files in the index, an operation that’s often costly on certain operating systems and file systems because it usually means the underlying I/O device must physically write all buffered bytes to stable storage. Near-real-time search enables you to search segments that are newly created but not yet committed. Section 11.1.3gives some tips for further reducing the index-to-search turnaround time.
- Listing 3.3 Near-real-time search

view plaincopy to clipboardprint?
package ch3;  
  
import java.io.File;  
  
import junit.framework.TestCase;  
  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.IndexWriterConfig;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.TermQuery;  
import org.apache.lucene.search.TopDocs;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.store.RAMDirectory;  
import org.apache.lucene.util.Version;  
  
public class NearRealTimeTest extends TestCase {  
    public File indexpath = new File("./test");  
    public static Version LUCENE_VERSION = Version.LUCENE_30;  
      
    public void testNearRealTime() throws Exception {  
        Directory dir = FSDirectory.open(indexpath);  
          
        Analyzer alyz = new StandardAnalyzer(LUCENE_VERSION);         
        IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, alyz);  
        iwConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);  
        IndexWriter writer = new IndexWriter(dir, iwConfig);          
          
        for (int i = 0; i < 10; i++) {  
            Document doc = new Document();  
            doc.add(new Field("id", "" + i, Field.Store.NO,  
                    Field.Index.NOT_ANALYZED_NO_NORMS));  
            doc.add(new Field("text", "aaa", Field.Store.NO,  
                    Field.Index.ANALYZED));  
            writer.addDocument(doc);  
        }  
        // 1) Create near-real-time reader  
        IndexReader  reader = IndexReader.open(writer, true);  
          
        // 2) Wrapper reader into searcher  
        IndexSearcher searcher = new IndexSearcher(reader);  
        Query query = new TermQuery(new Term("text", "aaa"));  
        TopDocs docs = searcher.search(query, 1);  
          
        // 3) Searcher return 10 hits.  
        assertEquals(10, docs.totalHits);  
          
        // 4) Delete one document  
        writer.deleteDocuments(new Term("id", "7"));  
          
        // 5) Add one document  
        Document doc = new Document();  
        doc.add(new Field("id", "11", Field.Store.NO,  
                Field.Index.NOT_ANALYZED_NO_NORMS));  
        doc.add(new Field("text", "bbb", Field.Store.NO, Field.Index.ANALYZED));  
        writer.addDocument(doc);  
          
        // 6) Reopen reader  
        IndexReader newReader = IndexReader.openIfChanged(reader);  
          
        // 7) Confirm reader is new  
        assertFalse(reader == newReader);  
          
        // 8) Close old reader  
        reader.close();  
          
        // 9) Create new searcher and search again.  
        searcher = new IndexSearcher(newReader);  
        TopDocs hits = searcher.search(query, 10);  
          
        // 10) Confirm only 9 hits  
        assertEquals(9, hits.totalHits);  
        query = new TermQuery(new Term("text", "bbb"));  
          
        // 11) Confirm new added terms is searchable.  
        hits = searcher.search(query, 1);  
        assertEquals(1, hits.totalHits);  
          
        // 12) Close all resources.  
        newReader.close();  
        writer.close();  
    }  
}  

程式扎記

標籤

2012年10月24日星期三

[ InAction Note ] Ch3. Adding search - Using IndexSearcher

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年10月24日 星期三