程式扎記: [ InAction Note ] Ch1. Meet Lucene - Understanding the core searching/indexing classes

標籤

2012年10月7日 星期日

[ InAction Note ] Ch1. Meet Lucene - Understanding the core searching/indexing classes


Understanding the core indexing classes :
As you saw in our Indexer class (A simple application), you need the following classes to perform the simplest indexing procedure:
IndexWriter
Directory
Analyzer
Document
Field


Figure 1.5 shows how these classes each participate in the indexing process. What follows is a brief overview of each of these classes, to give you a rough idea of their role in Lucene. We’ll use these classes throughout this book.


- IndexWriter
IndexWriter is the central component of the indexing process. This class creates a new index or opens an existing one, and adds, removes, or updates documents in the index. Think of IndexWriter as an object that gives you write access to the index but doesn’t let you read or search it. IndexWriter needs somewhere to store its index, and that’s what Directory is for.

- Directory
The Directory class represents the location of a Lucene index. It’s an abstract class that allows its subclasses to store the index as they see fit. In our Indexer example, we used FSDirectory.open to get a suitable concrete FSDirectory implementation that stores real files in a directory on the file system, and passed that in turn to Index-Writer’s constructor.

Lucene includes a number of interesting Directory implementations, covered in section 2.10. IndexWriter can’t index text unless it’s first been broken into separate words, using an analyzer.

- Analyzer
Before text is indexed, it’s passed through an analyzer. The analyzer, specified in the IndexWriter constructor, is in charge of extracting those tokens out of text that should be indexed and eliminating the rest. If the content to be indexed isn’t plain text, you should first extract plain text from it before indexing. Chapter 7 shows how to use Tika to extract text from the most common rich-media document formats. Analyzer is an abstract class, but Lucene comes with several implementations of it. Some of them deal with skipping stop words (frequently used words that don’t help distinguish one document from the other, such as aan,thein, and on); some deal with conversion of tokens to lowercase letters, so that searches aren’t case sensitive; and so on. Analyzers are an important part of Lucene and can be used for much more than simple input filtering. For a developer integrating Lucene into an application, the choice of analyzer(s) is a critical element of application design. You’ll learn much more about them in chapter 4.

The analysis process requires a document, containing separate fields to be indexed.

- Document
The Document class represents a collection of fields. Think of it as a virtual document—a chunk of data, such as a web page, an email message, or a text file—that you want to make retrievable at a later time. Fields of a document represent the document or metadata associated with that document. The original source (such as a database record, a Microsoft Word document, a chapter from a book, and so on) of document data is irrelevant to Lucene. It’s the text that you extract from such binary documents, and add as a Field instance, that Lucene processes. The metadata (such as author, title, subject and date modified) is indexed and stored separately as fields of a document.

Lucene only deals with text and numbers. Lucene’s core doesn’t itself handle anything but java.lang.String, java.io.Reader, and native numeric types (such as int or float). Although various types of documents can be indexed and made searchable, processing them isn’t as straightforward as processing purely textual or numeric content. You’ll learn more about handling nontext documents in chapter 7.

In our Indexer, we’re concerned with indexing text files. So, for each text file we find, we create a new instance of the Document class, populate it with fields (described next), and add that document to the index, effectively indexing the file. Similarly, in your application, you must carefully design how a Lucene document and its fields will be constructed to match specific needs of your content sources and application.

A document is simply a container for multiple fieldsField is the class that holds the textual content to be indexed.

- Field
Each document in an index contains one or more named fields, embodied in a class called Field. Each field has a name and corresponding value, and a bunch of options, described in section 2.4, that control precisely how Lucene will index the field’s value. A document may have more than one field with the same name. In this case, the values of the fields are appended, during indexing, in the order they were added to the document. When searching, it’s exactly as if the text from all the fields were concatenated and treated as a single text field.

You’ll apply this handful of classes most often when using Lucene for indexing. To implement basic search functionality, you need to be familiar with an equally small and simple set of Lucene search classes.

Understanding the core searching classes :
The basic search interface that Lucene provides is as straightforward as the one for indexing. Only a few classes are needed to perform the basic search operation:
IndexSearcher
Term
Query
TermQuery
TopDocs


The following sections provide a brief introduction to these classes. Chapter 3 will have more advanced topics on them.

- IndexSearcher
IndexSearcher is to searching what IndexWriter is to indexing: the central link to the index that exposes several search methods. You can think of IndexSearcheras a class that opens an index in a read-only mode. It requires a Directory instance, holding the previously created index, and then offers a number of search methods, some of which are implemented in its abstract parent class Searcher; the simplest takes a Query object and an int topN count as parameters and returns aTopDocs object. A typical use of this method looks like this:
  1. Directory dir = FSDirectory.open(new File("/tmp/index"));  
  2. IndexSearcher searcher = new IndexSearcher(dir);  
  3. Query q = new TermQuery(new Term("contents""lucene"));  
  4. TopDocs hits = searcher.search(q, 10);  
  5. searcher.close();  
We cover the details of IndexSearcher in chapter 3, along with more advanced information in chapters 5 and 6. Now we’ll visit the fundamental unit of searching, Term.

- Term
Term is the basic unit for searching. Similar to the Field object, it consists of a pair of string elements: the name of the field and the word (text value) of that field. Note that Term objects are also involved in the indexing process. However, they’re created by Lucene’s internals, so you typically don’t need to think about them while indexing. During searching, you may construct Term objects and use them together with TermQuery:
  1. Query q = new TermQuery(new Term("contents""lucene"));  
  2. TopDocs hits = searcher.search(q, 10);  
This code instructs Lucene to find the top 10 documents that contain the word lucene in a field named contents, sorting the documents by descending relevance.

- Query
Lucene comes with a number of concrete Query subclasses. So far in this chapter we’ve mentioned only the most basic Lucene Query: TermQuery. Other Query types are BooleanQueryPhraseQueryPrefixQueryPhrasePrefixQueryTermRangeQueryNumericRangeQueryFilteredQuery, and SpanQuery. All of these are covered in chapters 3 and 5. Query is the common, abstract parent class. It contains several utility methods, the most interesting of which is setBoost(float), which enables you to tell Lucene that certain subqueries should have a stronger contribution to the final relevance score than other subqueries. The setBoost()method is described in section 3.5.12. Next we cover TermQuery, which is the building block for most complex queries in Lucene.

- TermQuery
TermQuery is the most basic type of query supported by Lucene, and it’s one of the primitive query types. It’s used for matching documents that contain fields with specific values, as you’ve seen in the last few paragraphs. Finally, wrapping up our brief tour of the core classes used for searching, we touch on TopDocs, which represents the result set returned by searching.

- TopDocs
The TopDocs class is a simple container of pointers to the top N ranked search results—documents that match a given query. For each of the top N results, TopDocsrecords the int docID (which you can use to retrieve the document) as well as the float score. Chapter 3 describes TopDocs in more detail.

This message was edited 27 times. Last update was at 08/10/2012 10:31:46

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!