程式扎記: [ InAction Note ] Ch2. Building a search index

Preface :
This chapter covers

* Performing basic index operations
* Boosting documents and fields during indexing
* Indexing dates, numbers, and sortable fields
* Advanced indexing topics

In chapter 1, you saw a simple indexing example. This chapter goes further and teaches you about index updates, parameters you can use to tune the indexing process, and more advanced indexing techniques that will help you get the most out of Lucene.

How Lucene models content :
Let’s first walk through its conceptual approach to modeling content. We’ll start with Lucene’s fundamental units of indexing and searching, documents and fields, then move on to important differences between Lucene and the more structured model of modern databases.

- Documents and fields
A document is Lucene’s atomic unit of indexing and searching. It’s a container that holds one or more fields, which in turn contain the “real” content. Each field has a name to identify it, a text or binary value, and a series of detailed options that describe what Lucene should do with the field’s value when you add the document to the index.To index your raw content sources, you must first translate it into Lucene’s documents and fields. Then, at search time, it’s the field values that are searched; for example, users could search for "title:lucene" to find all documents whose title field value contains the term lucene.

At a high level, there are three things Lucene can do with each field:

* The value may be indexed (or not). A field must be indexed if you intend to search on it. When a field is indexed, tokens are first derived from its text value, using a process called analysis, and then those tokens are enrolled into the index.
* If it’s indexed, the field may also optionally store term vectors, which are collectively a miniature inverted index for that one field, allowing you to retrieve all of its tokens. This enables certain advanced use cases, like searching for documents similar to an existing one
* Separately, the field’s value may be stored, meaning a verbatim copy of the unanalyzed value is written away in the index so that it can later be retrieved.

- Flexible schema
Unlike a database, Lucene has no notion of a fixed global schema. In other words, each document you add to the index is a blank slate and can be completely different from the document before it: it can have whatever fields you want, with any indexing and storing and term vector options. It need not have the same fields as the previous document you added.

Lucene’s flexible schema also means a single index can hold documents that represent different entities. For instance, you could have documents that represent retail products with fields such as name and price, and documents that represent people with fields such as name, age, and gender. You could also include unsearchable "meta" documents, which simply hold metadata about the index or your application (such as what time the index was last updated or which product catalog was indexed) but are never included in search results.

- Denormalization
One common challenge is resolving any “mismatch” between the structure of your documents versus what Lucene can represent. For example, XML can describe a recursive document structure by nesting tags within one another. A database can have an arbitrary number of joins, via primary and secondary keys, relating tables to one other. Yet Lucene documents are flat. Such recursion and joins must be denormalized when creating your documents. Open source projects that build on Lucene, likeHibernate Search, Compass, LuSQL, DBSight, Browse Engine, and Oracle/Lucene integration, each has different and interesting approaches for handling this denormalization.

Understanding the indexing process :
Only a few methods of Lucene’s public API need to be called in order to index a document. As a result, from the outside, indexing with Lucene looks like a deceptively simple and monolithic operation. But behind the simple API lies an interesting and relatively complex set of operations that we can break down into three major and functionally distinct groups, as described in the following sections and shown in figure 2.1.

Figure 2.1 Indexing with Lucene breaks down into three main operations: extracting text from source documents, analyzing it, and saving it to the index

During indexing, the text is first extracted from the original content and used to create an instance of Document, containing Field instances to hold the content. The text in the fields is then analyzed to produce a stream of tokens. Finally, those tokens are added to the index in a segmented architecture. Let’s talk about text extraction first.

- Extracting text and creating the document
To index data with Lucene, you must extract plain text from it, the format that Lucene can digest, and then create a Lucene document. Suppose you need to index a set of manuals in PDF format. To prepare these manuals for indexing, you must first find a way to extract the textual information from the PDF documents and use that extracted text to create Lucene documents and their fields. No methods would accept a PDF Java type, even if such a type existed. You face the same situation if you want to index Microsoft Word documents or any document format other than plain text.

The details of text extraction are in chapter 7 where we describe the Tika framework, which makes it almost too simple to extract text from documents in diverse formats. Once you have the text you’d like to index, and you’ve created a document with all fields you’d like to index, all text must then be analyzed.

- Analysis
Once you’ve created Lucene documents populated with fields, you can call IndexWriter’s addDocument method and hand your data off to Lucene to index. When you do that, Lucene first analyzes the text, a process that splits the textual data into a stream of tokens, and performs a number of optional operations on them. For instance, the tokens could be lowercased before indexing, to make searches case insensitive, using Lucene’s LowerCaseFilter. Typically it’s also desirable to remove all stop words, which are frequent but meaningless tokens, from the input (for example a, an, the, in, on, and so on, in English text) using StopFilter. Similarly, it’s common to process input tokens to reduce them to their roots, for example by using PorterStemFilter for English text (similar classes exist in Lucene’s contrib analysis module, for other languages). The combination of an original source of tokens, followed by the series of filters that modify the tokens produced by that source, make up the analyzer.You are also free to build your own analyzer by chaining together Lucene’s token sources and filters, or your own, in customized ways.

The input to Lucene can be analyzed in so many interesting and useful ways that we cover this process in detail in chapter 4. The analysis process produces a stream of tokens that are then written into the files in the index.

- Adding to the index
After the input has been analyzed, it’s ready to be added to the index. Lucene stores the input in a data structure known as an inverted index. This data structure makes efficient use of disk space while allowing quick keyword lookups. What makes this structure inverted is that it uses tokens extracted from input documents as lookup keys instead of treating documents as the central entities, much like the index of this book references the page number(s) where a concept occurs. In other words, rather than trying to answer the question “What words are contained in this document?” this structure is optimized for providing quick answers to “Which documents contain word X?”

If you think about your favorite web search engine and the format of your typical query, you’ll see that this is exactly the query that you want to be as quick as possible. The core of today’s web search engines are inverted indexes. Lucene’s index directory has a unique segmented architecture, which we describe next.

INDEX SEGMENTS
Lucene has a rich and detailed index file format that has been carefully optimized with time. Although you don’t need to know the details of this format in order to use Lucene, it’s still helpful to have some basic understanding at a high level.

Every Lucene index consists of one or more segments, as depicted in figure 2.2. Each segment is a standalone index, holding a subset of all indexed documents. A new segment is created whenever the writer flushes buffered added documents and pending deletions into the directory. At search time, each segment is visited separately and the results are combined.

Figure 2.2 Segmented structure of a Lucene inverted index

Each segment, in turn, consists of multiple files, of the form _X., where X is the segment’s name and is the extension that identifies which part of the index that file corresponds to. There are separate files to hold the different parts of the index (term vectors, stored fields, inverted index, and so on). If you’re using the compound file format (which is enabled by default but you can change using IndexWriter.setUseCompoundFile), then most of these index files are collapsed into a single compound file: _X.cfs. This reduces the number of open file descriptors during searching, at a small cost of searching and indexing performance. Chapter 11 covers this trade-off in more detail. Below is the example of index folder:

There’s one special file, referred to as the segments file and named segments_, that references all live segments. This file is important! Lucene first opens this file, and then opens each segment referenced by it. The value , called “the generation,” is an integer that increases by one every time a change is committed to the index.

Naturally, over time the index will accumulate many segments, especially if you open and close your writer frequently. This is fine. Periodically, IndexWriter will select segments and coalesce them by merging them into a single new segment and then removing the old segments. The selection of segments to be merged is governed by a separate MergePolicy. Once merges are selected, their execution is done by the MergeScheduler.

Basic index operations :
Now it’s time to look at some real code, using Lucene’s APIs to add, remove, and update documents. We start with adding documents to an index since that’s the most frequent operation.

- Adding documents to an index
Let’s look at how to create a new index and add documents to it. There are two methods for adding documents:
* addDocument(Document)

Adds the document using the default analyzer, which you specified when creating the IndexWriter, for tokenization.

* addDocument(Document, Analyzer)

Adds the document using the provided analyzer for tokenization. But be careful! In order for searches to work correctly, you need the analyzer used at search time to “match” the tokens produced by the analyzers at indexing time.

Listing 2.1 shows all the steps necessary to create a new index and add two tiny documents.
- Listing 2.1 Adding documents to an index

view plaincopy to clipboardprint?
package ch2;  
  
import java.io.IOException;  
  
import junit.framework.TestCase;  
  
import org.apache.lucene.analysis.WhitespaceAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.index.IndexReader;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.Term;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.TermQuery;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.RAMDirectory;  
  
public class IndexingTest extends TestCase {  
    protected String[] ids = { "1", "2" };  
    protected String[] unindexed = { "Netherlands", "Italy" };  
    protected String[] unstored = { "Amsterdam has lots of bridges",  
            "Venice has lots of canals" };  
    protected String[] text = { "Amsterdam", "Venice" };  
    private Directory directory;  
  
    protected void setUp() throws Exception {  
        // 1) Run before every test  
        directory = new RAMDirectory();  
          
        // 2) Cretae IndexWriter  
        IndexWriter writer = getWriter();  
          
        // 3) Add document  
        for (int i = 0; i < ids.length; i++) {  
            Document doc = new Document();  
            doc.add(new Field("id", ids[i], Field.Store.YES,  
                    Field.Index.NOT_ANALYZED));  
            doc.add(new Field("country", unindexed[i], Field.Store.YES,  
                    Field.Index.NO));  
            doc.add(new Field("contents", unstored[i], Field.Store.NO,  
                    Field.Index.ANALYZED));  
            doc.add(new Field("city", text[i], Field.Store.YES,  
                    Field.Index.ANALYZED));  
            writer.addDocument(doc);  
        }  
        writer.close();  
    }  
  
    private IndexWriter getWriter() throws IOException {  
        // 2) Create IndexWriter  
        return new IndexWriter(directory, new WhitespaceAnalyzer(),  
                IndexWriter.MaxFieldLength.UNLIMITED);  
    }  
  
    protected int getHitCount(String fieldName, String searchString)  
            throws IOException {  
        // 4) Create new searcher  
        IndexSearcher searcher = new IndexSearcher(directory);  
          
        // 5) Build single-term query.  
        Term t = new Term(fieldName, searchString);  
        Query query = new TermQuery(t);  
          
        // 6) Get number of hit.  
        int hitCount = searcher.search(query, 10).totalHits;  
        searcher.close();  
        return hitCount;  
    }  
  
    public void testIndexWriter() throws IOException {  
        // 7) Verify writer document count.  
        IndexWriter writer = getWriter();  
        assertEquals(ids.length, writer.numDocs());  
        writer.close();  
    }  
  
    public void testIndexReader() throws IOException {  
        // 8) Verify reader document count.       
        IndexReader reader = IndexReader.open(directory);  
        assertEquals(ids.length, reader.maxDoc());  
        assertEquals(ids.length, reader.numDocs());  
        reader.close();  
    }  
}  

The index contains two documents, each representing a country and a city in that country, whose text is analyzed with WhitespaceAnalyzer. Because setUp() is called before each test is executed, each test runs against a freshly created index. In the getWriter method, we create the IndexWriter with three arguments:

* Directory, where the index is stored.
* The analyzer to use when indexing tokenized fields (analysis is covered in chapter 4).
* MaxFieldLength.UNLIMITED, a required argument that tells IndexWriter to index all tokens in the document

IndexWriter will detect that there’s no prior index in this Directory and create a new one. If there were an existing index, IndexWriter would simply add to it. There are numerous IndexWriter constructors. Some explicitly take a create argument, allowing you to force a new index to be created over an existing one. More advanced constructors allow you to specify your own IndexDeletionPolicy or IndexCommit for expert use cases, as described in section 2.13.

- Deleting documents from an index
Although most applications are more concerned with getting documents into a Lucene index, some also need to remove them. IndexWriter provides various methods to remove documents from an index:
* deleteDocuments(Term)

deletes all documents containing the provided term.

* deleteDocuments(Term[])

deletes all documents containing any of the terms in the provided array.

* deleteDocuments(Query)

deletes all documents matching the provided query.

* deleteDocuments(Query[])

deletes all documents matching any of the queries in the provided array.

* deleteAll()

deletes all documents in the index. This is exactly the same as closing the writer and opening a new writer with create=true, without having to close your writer.

If you intend to delete a single document by Term, you must ensure you’ve indexed a Field on every document and that all field values are unique so that each document can be singled out for deletion. This is the same concept as a primary key column in a database table, but in no way is it enforced by Lucene. This field should be indexed as an unanalyzed field (see section 2.4.1) to ensure the analyzer doesn’t break it up into separate tokens. Then, use the field for document deletion like this:

view plaincopy to clipboardprint?
writer.deleteDocuments(new Term("ID", documentID));  

Let’s look at listing 2.2 to see deleteDocuments in action:
- Listing 2.2 Deleting documents from an index

view plaincopy to clipboardprint?
public void testDeleteBeforeOptimize() throws IOException {       
    IndexWriter writer = getWriter();  
    assertEquals(2, writer.numDocs()); // Verify 2 doc in index.  
    writer.deleteDocuments(new Term("id", "1")); // Delete the first doc  
    writer.commit();  
    // 1) Verify index has deletion.  
    assertTrue(writer.hasDeletions());  
      
    // 2) Verify the one document being deleted.  
    assertEquals(2, writer.maxDoc()); // Deleted document still in memory. Not flush yet.  
    assertEquals(1, writer.numDocs());  
    writer.close();  
}  
  
public void testDeleteAfterOptimize() throws IOException {  
    System.out.printf("\t[Test] testDeleteAfterOptimize()...\n");  
    IndexWriter writer = getWriter();  
    assertEquals(2, writer.numDocs());  
    writer.deleteDocuments(new Term("id", "1"));  
      
    // 3) Optimize to compact deletion.  
    writer.optimize();    
    writer.commit();  
    assertFalse(writer.hasDeletions());  
    assertEquals(1, writer.maxDoc());  // Deleted document already being flushed.x  
    assertEquals(1, writer.numDocs());  
    writer.close();  
}  

In the method testDeleteAfterOptimize(), we force Lucene to merge index segments, after deleting one document, by optimizing the index. Then, the maxDoc() method returns 1 rather than 2, because after a delete and optimize, Lucene truly removes the deleted document. Only one document remains in the index.

- Updating documents in the index
Some cases you may want to update only certain fields of the document. Perhaps the title changed but the body was unchanged. Unfortunately, Lucene can’t do that: instead, it deletes the entire previous document and then adds a new document to the index. This requires that the new document contains all fields, even unchanged ones, from the original document. IndexWriter provides two convenience methods to replace a document in the index:
* updateDocument(Term, Document)

First deletes all documents containing the provided term and then adds the new document using the writer’s default analyzer.

* updateDocument(Term, Document, Analyzer)

does the same but uses the provided analyzer instead of the writer’s default analyzer.

The updateDocument methods are probably the most common way to handle deletion because they’re typically used to replace a single document in the index that has changed. Note that these methods are simply shorthand for first calling deleteDocuments(Term) and then addDocument. Use updateDocument like this:

view plaincopy to clipboardprint?
writer.updateDocument(new Term("ID", documenteId), newDocument);  

Below sample code is an example:
- Listing 2.3 Updating indexed Documents

view plaincopy to clipboardprint?
public void testUpdate() throws IOException {  
    assertEquals(1, getHitCount("city", "Amsterdam"));  
    IndexWriter writer = getWriter();  
    Document doc = new Document();  
    doc.add(new Field("id", "1", Field.Store.YES, Field.Index.NOT_ANALYZED));  
    doc.add(new Field("country", "Netherlands", Field.Store.YES,  
            Field.Index.NO));  
    doc.add(new Field("contents", "Den Haag has a lot of museums",  
            Field.Store.NO, Field.Index.ANALYZED));  
    doc.add(new Field("city", "ABC", Field.Store.YES,  
            Field.Index.ANALYZED));  
    writer.updateDocument(new Term("id", "1"), doc);  
    writer.commit();          
    writer.close();       
    assertEquals(0, getHitCount("city", "Amsterdam"));  
    assertEquals(1, getHitCount("city", "ABC"));  
}  

We create a new document that will replace the original document with id=1. Then we call updateDocument to replace the original one. We have effectively updated one of the documents in the index.

程式扎記

標籤

2012年10月11日星期四

[ InAction Note ] Ch2. Building a search index - Basic index operations

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年10月11日 星期四