Field is perhaps the most important class when indexing documents: it’s the actual class that holds each value to be indexed. When you create a field, you can specify numerous options to control what Lucene should do with that field once you add the document to the index. We touched on these options at a high level here.
Field options for indexing:
The options for indexing (Field.Index.*) control how the text in the field will be made searchable via the inverted index. Here are the choices:
When Lucene builds the inverted index, by default it stores all necessary information to implement the Vector Space Model. This model requires the count of every term that occurred in the document, as well as the positions of each occurrence (needed, for example, by phrase searches). But sometimes you know the field will be used only for pure Boolean searching and need not contribute to the relevance score. Fields used only for filtering, such as entitlements or date filtering, are a common example.
In this case, you can tell Lucene to skip indexing the term frequency and positions by calling Field.setOmitTermFreqAndPositions(true). This approach will save some disk space in the index, and may also speed up searching and filtering, but will silently prevent searches that require positional information, such as PhraseQuery andSpanQuery, from working.
Field options for storing fields:
The options for stored fields (Field.Store.*) determine whether the field’s exact value should be stored away so that you can later retrieve it during searching:
Lucene includes a helpful utility class, CompressionTools, that exposes static methods to compress and decompress byte arrays. Under the hood it uses Java’s built-injava.util.Zip classes. You can use CompressionTools to compress values before storing them in Lucene. Note that although doing so will save space in your index, depending on how compressible the content is, it will also slow down indexing and searching. You’re spending more CPU in exchange for less disk space used, which for many applications isn’t a good trade-off. If the field values are small, compression is rarely worthwhile.
Field options for term vectors:
Sometimes when you index a document you’d like to retrieve all its unique terms at search time. One common use is to speed up highlighting the matched tokens in stored fields. (Highlighting is covered more in sections 8.3 and 8.4.) Another use is to enable a link, "Find similar documents," that when clicked runs a new search using the salient terms in an original document. Yet another example is automatic categorization of documents. Section 5.9 shows concrete examples of using term vectors once they’re in your index.
But what exactly are term vectors? Term vectors are a mix between an indexed field and a stored field. They’re similar to a stored field because you can quickly retrieve all term vector fields for a given document: term vectors are keyed first by document ID. But then, they’re keyed secondarily by term, meaning they store a miniature inverted index for that one document. Unlike a stored field, where the original String content is stored verbatim, term vectors store the actual separate terms that were produced by the analyzer, allowing you to retrieve all terms for each field, and the frequency of their occurrence within the document, sorted in lexicographic order. Because the tokens coming out of an analyzer also have position and offset information (see section 4.2.1), you can choose separately whether these details are also stored in your term vectors by passing these constants as the fourth argument to the Field constructor:
Reader, TokenStream, and byte field values:
There are a few other constructors for the Field object that allow you to use values other than String:
- Field(String name, Reader value, TermVector termVector)
- Field(String name, Reader value)
- Field(String name, TokenStream tokenStream, TermVector termVector)
- Field(String name, TokenStream tokenStream)
- Field(String name, byte value, Store store)
- Field(String name, byte value, int offset, int length, Store store)
Field option combinations:
You’ve now seen all the options for the three categories (indexing, storing, and term vectors) you can use to control how Lucene handles a field. These options can nearly be set independently, resulting in a number of possible combinations. Table 2.1 lists commonly used options and their example usage, but remember you are free to set the options however you’d like.
Field options for sorting:
When returning documents that match a search, Lucene orders them by their score by default. Sometimes, you need to order results using other criteria. For instance, if you’re searching email messages, you may want to order results by sent or received date, or perhaps by message size or sender. Section 5.2 describes sorting in more detail, but in order to perform field sorting, you must first index the fields correctly.
If the field is numeric, use NumericField, covered in section 2.6.1, when adding it to the document, and sorting will work correctly. If the field is textual, such as the sender’s name in an email message, you must add it as a Field that’s indexed but not analyzed using Field.Index.NOT_ANALYZED. If you aren’t doing any boosting for the field, you should index it without norms, to save disk space and memory, using Field.Index.NOT_ANALYZED_NO_NORMS:
Suppose your documents have an author field, but sometimes there’s more than one author for a document. One way to handle this would be to loop through all the authors, appending them into a single String, which you could then use to create a Lucene field. Another, perhaps more elegant way is to keep adding the same Field with different value, like this: