程式扎記: [ InAction Note ] Ch2. Building a search index

Preface:
Field is perhaps the most important class when indexing documents: it’s the actual class that holds each value to be indexed. When you create a field, you can specify numerous options to control what Lucene should do with that field once you add the document to the index. We touched on these options at a high level here.

Field options for indexing:
The options for indexing (Field.Index.*) control how the text in the field will be made searchable via the inverted index. Here are the choices:
* Index.ANALYZED

Use the analyzer to break the field’s value into a stream of separate tokens and make each token searchable. This option is useful for normal text fields (body, title, abstract, etc.).

* Index.NOT_ANALYZED

Do index the field, but don’t analyze the String value. Instead, treat the Field’s entire value as a single token and make that token searchable. This option is useful for fields that you’d like to search on but that shouldn’t be broken up, such as URLs, file system paths, dates, personal names, Social Security numbers, and telephone numbers. This option is especially useful for enabling "exact match" searching.

* Index.ANALYZED_NO_NORMS

A variant of Index.ANALYZED that doesn’t store norms information in the index. Norms record index-time boost information in the index but can be memory consuming when you’re searching. Section 2.5.3 describes norms in detail.

* Index.NOT_ANALYZED_NO_NORMS

Just like Index.NOT_ANALYZED, but also doesn’t store norms. This option is frequently used to save index space and memory usage during searching, because single-token fields don’t need the norms information unless they’re boosted.

* Index.NO

Don’t make this field’s value available for searching.

When Lucene builds the inverted index, by default it stores all necessary information to implement the Vector Space Model. This model requires the count of every term that occurred in the document, as well as the positions of each occurrence (needed, for example, by phrase searches). But sometimes you know the field will be used only for pure Boolean searching and need not contribute to the relevance score. Fields used only for filtering, such as entitlements or date filtering, are a common example.

In this case, you can tell Lucene to skip indexing the term frequency and positions by calling Field.setOmitTermFreqAndPositions(true). This approach will save some disk space in the index, and may also speed up searching and filtering, but will silently prevent searches that require positional information, such as PhraseQuery andSpanQuery, from working.

Field options for storing fields:
The options for stored fields (Field.Store.*) determine whether the field’s exact value should be stored away so that you can later retrieve it during searching:
* Store.YES

Stores the value. When the value is stored, the original String in its entirety is recorded in the index and may be retrieved by an IndexReader. This option is useful for fields that you’d like to use when displaying the search results (such as a URL, title, or database primary key). Try not to store very large fields, if index size is a concern, as stored fields consume space in the index.

* Store.NO

Doesn’t store the value. This option is often used along with Index.ANALYZED to index a large text field that doesn’t need to be retrieved in its original form, such as bodies of web pages, or any other type of text document.

Lucene includes a helpful utility class, CompressionTools, that exposes static methods to compress and decompress byte arrays. Under the hood it uses Java’s built-injava.util.Zip classes. You can use CompressionTools to compress values before storing them in Lucene. Note that although doing so will save space in your index, depending on how compressible the content is, it will also slow down indexing and searching. You’re spending more CPU in exchange for less disk space used, which for many applications isn’t a good trade-off. If the field values are small, compression is rarely worthwhile.

Field options for term vectors:
Sometimes when you index a document you’d like to retrieve all its unique terms at search time. One common use is to speed up highlighting the matched tokens in stored fields. (Highlighting is covered more in sections 8.3 and 8.4.) Another use is to enable a link, "Find similar documents," that when clicked runs a new search using the salient terms in an original document. Yet another example is automatic categorization of documents. Section 5.9 shows concrete examples of using term vectors once they’re in your index.

But what exactly are term vectors? Term vectors are a mix between an indexed field and a stored field. They’re similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID. But then, they’re keyed secondarily by term, meaning they store a miniature inverted index for that one document. Unlike a stored field, where the original String content is stored verbatim, term vectors store the actual separate terms that were produced by the analyzer, allowing you to retrieve all terms for each field, and the frequency of their occurrence within the document, sorted in lexicographic order. Because the tokens coming out of an analyzer also have position and offset information (see section 4.2.1), you can choose separately whether these details are also stored in your term vectors by passing these constants as the fourth argument to the Field constructor:
* TermVector.YES

Records the unique terms that occurred, and their counts, in each document, but doesn’t store any positions or offsets information

* TermVector.WITH_POSITIONS

Records the unique terms and their counts, and also the positions of each occurrence of every term, but no offsets

* TermVector.WITH_OFFSETS

Records the unique terms and their counts, with the offsets (start and end character position) of each occurrence of every term, but no positions

* TermVector.WITH_POSITIONS_OFFSETS

Stores unique terms and their counts, along with positions and offsets

* TermVector.NO

Doesn’t store any term vector information

Reader, TokenStream, and byte[] field values:
There are a few other constructors for the Field object that allow you to use values other than String:
- Field(String name, Reader value, TermVector termVector)

Uses a Reader instead of a String to represent the value. In this case, the value can’t be stored (the option is hardwired to Store.NO) and is always analyzed and indexed (Index.ANALYZED). This can be useful when holding the full String in memory might be too costly or inconvenient—for example, for very large values.

- Field(String name, Reader value)

Like the previous value, uses a Reader instead of a String to represent the value but defaults termVector to TermVector.NO.

- Field(String name, TokenStream tokenStream, TermVector termVector)

Allows you to preanalyze the field value into a TokenStream. Likewise, such fields aren’t stored and are always analyzed and indexed.

- Field(String name, TokenStream tokenStream)

Like the previous value, allows you to preanalyze the field value into a TokenStream but defaults termVector to TermVector.NO.

- Field(String name, byte[] value, Store store)

This is used to store a binary field. Such fields are never indexed (Index.NO) and have no term vectors (TermVector.NO). The store argument must be Store.YES.

- Field(String name, byte[] value, int offset, int length, Store store)

Like the previous value, indexes a binary field but allows you to reference a subslice of the bytes starting at offset and running for length bytes.

Field option combinations:
You’ve now seen all the options for the three categories (indexing, storing, and term vectors) you can use to control how Lucene handles a field. These options can nearly be set independently, resulting in a number of possible combinations. Table 2.1 lists commonly used options and their example usage, but remember you are free to set the options however you’d like.

Field options for sorting:
When returning documents that match a search, Lucene orders them by their score by default. Sometimes, you need to order results using other criteria. For instance, if you’re searching email messages, you may want to order results by sent or received date, or perhaps by message size or sender. Section 5.2 describes sorting in more detail, but in order to perform field sorting, you must first index the fields correctly.

If the field is numeric, use NumericField, covered in section 2.6.1, when adding it to the document, and sorting will work correctly. If the field is textual, such as the sender’s name in an email message, you must add it as a Field that’s indexed but not analyzed using Field.Index.NOT_ANALYZED. If you aren’t doing any boosting for the field, you should index it without norms, to save disk space and memory, using Field.Index.NOT_ANALYZED_NO_NORMS:

view plaincopy to clipboardprint?
new Field("author", "Arthur C. Clark", Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);  

Multivalued fields:
Suppose your documents have an author field, but sometimes there’s more than one author for a document. One way to handle this would be to loop through all the authors, appending them into a single String, which you could then use to create a Lucene field. Another, perhaps more elegant way is to keep adding the same Field with different value, like this:

view plaincopy to clipboardprint?
Document doc = new Document();  
for (String author : authors) {  
    doc.add(new Field("author", author,  
    Field.Store.YES,  
    Field.Index.ANALYZED));  
}  

This is perfectly acceptable and encouraged, as it’s a natural way to represent a field that legitimately has multiple values. Internally, whenever multiple fields with the same name appear in one document, both the inverted index and term vectors will logically append the tokens of the field to one another, in the order the fields were added. You can use advanced options during analysis that control certain important details of this appending, notably how to prevent searches from matching across two different field values; see section 4.7.1 for details. But, unlike indexing, when the fields are stored they’re stored separately in order in the document, so that when you retrieve the document at search time you’ll see multiple Field instances.

程式扎記

標籤

2012年10月18日星期四

[ InAction Note ] Ch2. Building a search index - Field options

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年10月18日 星期四