This chapter covers
In chapter 1, you saw a simple indexing example. This chapter goes further and teaches you about index updates, parameters you can use to tune the indexing process, and more advanced indexing techniques that will help you get the most out of Lucene.
How Lucene models content :
Let’s first walk through its conceptual approach to modeling content. We’ll start with Lucene’s fundamental units of indexing and searching, documents and fields, then move on to important differences between Lucene and the more structured model of modern databases.
- Documents and fields
A document is Lucene’s atomic unit of indexing and searching. It’s a container that holds one or more fields, which in turn contain the “real” content. Each field has a name to identify it, a text or binary value, and a series of detailed options that describe what Lucene should do with the field’s value when you add the document to the index.To index your raw content sources, you must first translate it into Lucene’s documents and fields. Then, at search time, it’s the field values that are searched; for example, users could search for "title:lucene" to find all documents whose title field value contains the term lucene.
At a high level, there are three things Lucene can do with each field:
- Flexible schema
Unlike a database, Lucene has no notion of a fixed global schema. In other words, each document you add to the index is a blank slate and can be completely different from the document before it: it can have whatever fields you want, with any indexing and storing and term vector options. It need not have the same fields as the previous document you added.
Lucene’s flexible schema also means a single index can hold documents that represent different entities. For instance, you could have documents that represent retail products with fields such as name and price, and documents that represent people with fields such as name, age, and gender. You could also include unsearchable "meta" documents, which simply hold metadata about the index or your application (such as what time the index was last updated or which product catalog was indexed) but are never included in search results.
One common challenge is resolving any “mismatch” between the structure of your documents versus what Lucene can represent. For example, XML can describe a recursive document structure by nesting tags within one another. A database can have an arbitrary number of joins, via primary and secondary keys, relating tables to one other. Yet Lucene documents are flat. Such recursion and joins must be denormalized when creating your documents. Open source projects that build on Lucene, likeHibernate Search, Compass, LuSQL, DBSight, Browse Engine, and Oracle/Lucene integration, each has different and interesting approaches for handling this denormalization.
Understanding the indexing process :
Only a few methods of Lucene’s public API need to be called in order to index a document. As a result, from the outside, indexing with Lucene looks like a deceptively simple and monolithic operation. But behind the simple API lies an interesting and relatively complex set of operations that we can break down into three major and functionally distinct groups, as described in the following sections and shown in figure 2.1.
Figure 2.1 Indexing with Lucene breaks down into three main operations: extracting text from source documents, analyzing it, and saving it to the index
During indexing, the text is first extracted from the original content and used to create an instance of Document, containing Field instances to hold the content. The text in the fields is then analyzed to produce a stream of tokens. Finally, those tokens are added to the index in a segmented architecture. Let’s talk about text extraction first.
- Extracting text and creating the document
To index data with Lucene, you must extract plain text from it, the format that Lucene can digest, and then create a Lucene document. Suppose you need to index a set of manuals in PDF format. To prepare these manuals for indexing, you must first find a way to extract the textual information from the PDF documents and use that extracted text to create Lucene documents and their fields. No methods would accept a PDF Java type, even if such a type existed. You face the same situation if you want to index Microsoft Word documents or any document format other than plain text.
The details of text extraction are in chapter 7 where we describe the Tika framework, which makes it almost too simple to extract text from documents in diverse formats. Once you have the text you’d like to index, and you’ve created a document with all fields you’d like to index, all text must then be analyzed.
Once you’ve created Lucene documents populated with fields, you can call IndexWriter’s addDocument method and hand your data off to Lucene to index. When you do that, Lucene first analyzes the text, a process that splits the textual data into a stream of tokens, and performs a number of optional operations on them. For instance, the tokens could be lowercased before indexing, to make searches case insensitive, using Lucene’s LowerCaseFilter. Typically it’s also desirable to remove all stop words, which are frequent but meaningless tokens, from the input (for example a, an, the, in, on, and so on, in English text) using StopFilter. Similarly, it’s common to process input tokens to reduce them to their roots, for example by using PorterStemFilter for English text (similar classes exist in Lucene’s contrib analysis module, for other languages). The combination of an original source of tokens, followed by the series of filters that modify the tokens produced by that source, make up the analyzer.You are also free to build your own analyzer by chaining together Lucene’s token sources and filters, or your own, in customized ways.
The input to Lucene can be analyzed in so many interesting and useful ways that we cover this process in detail in chapter 4. The analysis process produces a stream of tokens that are then written into the files in the index.
- Adding to the index
After the input has been analyzed, it’s ready to be added to the index. Lucene stores the input in a data structure known as an inverted index. This data structure makes efficient use of disk space while allowing quick keyword lookups. What makes this structure inverted is that it uses tokens extracted from input documents as lookup keys instead of treating documents as the central entities, much like the index of this book references the page number(s) where a concept occurs. In other words, rather than trying to answer the question “What words are contained in this document?” this structure is optimized for providing quick answers to “Which documents contain word X?”
If you think about your favorite web search engine and the format of your typical query, you’ll see that this is exactly the query that you want to be as quick as possible. The core of today’s web search engines are inverted indexes. Lucene’s index directory has a unique segmented architecture, which we describe next.
Lucene has a rich and detailed index file format that has been carefully optimized with time. Although you don’t need to know the details of this format in order to use Lucene, it’s still helpful to have some basic understanding at a high level.
Every Lucene index consists of one or more segments, as depicted in figure 2.2. Each segment is a standalone index, holding a subset of all indexed documents. A new segment is created whenever the writer flushes buffered added documents and pending deletions into the directory. At search time, each segment is visited separately and the results are combined.
Figure 2.2 Segmented structure of a Lucene inverted index
Each segment, in turn, consists of multiple files, of the form _X.
There’s one special file, referred to as the segments file and named segments_
Naturally, over time the index will accumulate many segments, especially if you open and close your writer frequently. This is fine. Periodically, IndexWriter will select segments and coalesce them by merging them into a single new segment and then removing the old segments. The selection of segments to be merged is governed by a separate MergePolicy. Once merges are selected, their execution is done by the MergeScheduler.
Basic index operations :
Now it’s time to look at some real code, using Lucene’s APIs to add, remove, and update documents. We start with adding documents to an index since that’s the most frequent operation.
- Adding documents to an index
Let’s look at how to create a new index and add documents to it. There are two methods for adding documents:
* addDocument(Document, Analyzer)
Listing 2.1 shows all the steps necessary to create a new index and add two tiny documents.
- Listing 2.1 Adding documents to an index
IndexWriter will detect that there’s no prior index in this Directory and create a new one. If there were an existing index, IndexWriter would simply add to it. There are numerous IndexWriter constructors. Some explicitly take a create argument, allowing you to force a new index to be created over an existing one. More advanced constructors allow you to specify your own IndexDeletionPolicy or IndexCommit for expert use cases, as described in section 2.13.
- Deleting documents from an index
Although most applications are more concerned with getting documents into a Lucene index, some also need to remove them. IndexWriter provides various methods to remove documents from an index:
If you intend to delete a single document by Term, you must ensure you’ve indexed a Field on every document and that all field values are unique so that each document can be singled out for deletion. This is the same concept as a primary key column in a database table, but in no way is it enforced by Lucene. This field should be indexed as an unanalyzed field (see section 2.4.1) to ensure the analyzer doesn’t break it up into separate tokens. Then, use the field for document deletion like this:
- Listing 2.2 Deleting documents from an index
- Updating documents in the index
Some cases you may want to update only certain fields of the document. Perhaps the title changed but the body was unchanged. Unfortunately, Lucene can’t do that: instead, it deletes the entire previous document and then adds a new document to the index. This requires that the new document contains all fields, even unchanged ones, from the original document. IndexWriter provides two convenience methods to replace a document in the index:
* updateDocument(Term, Document)
* updateDocument(Term, Document, Analyzer)
The updateDocument methods are probably the most common way to handle deletion because they’re typically used to replace a single document in the index that has changed. Note that these methods are simply shorthand for first calling deleteDocuments(Term) and then addDocument. Use updateDocument like this:
- Listing 2.3 Updating indexed Documents