This chapter covers :
Lucene is a powerful Java search library that lets you easily add search to any application. In recent years Lucene has become exceptionally popular and is now the most widely used information retrieval library: it powers the search features behind many websites and desktop applications.
In this chapter we cover the overall architecture of a typical search application and where Lucene fits. It’s crucial to recognize that Lucene is simply a search library, and you’ll need to handle the other components of a search application (crawling, document filtering, runtime server, user interface, administration, etc.) as your application requires. We show you how to perform basic indexing and searching with ready-to-use code examples.
Dealing with information explosion :
As you’ll soon discover, Lucene provides a simple yet powerful core API that requires minimal understanding of full-text indexing and searching. You need to learn about only a handful of its classes in order to start integrating Lucene into an application. Because Lucene is a Java library, it doesn’t make assumptions about what it indexes and searches, which gives it an advantage over a number of other search applications. Its design is compact and simple, allowing Lucene to be easily embedded into desktop applications.
Lucene’s website, at http://lucene.apache.org/java, is a great place to learn more about the current status of Lucene. There you’ll find the tutorial, Javadocs for Lucene’s API for all recent releases, an issue-tracking system, links for downloading releases, and Lucene’s wiki (http://wiki.apache.org/lucene-java), which contains many community-created and -maintained pages.
- What Lucene can do
Lucene allows you to add search capabilities to your application. Lucene can index and make searchable any data that you can extract text from. Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it.
- History of Lucene
Lucene was written by Doug Cutting;it was initially available for download from its home at the SourceForge website. It joined the Apache Software Foundation’s Jakarta family of high-quality open source Java products in September 2001 and became its own top-level Apache project in February 2005.
Lucene and the components of a search application :
It’s important to grasp the big picture so that you have a clear understanding of which parts Lucene can handle and which parts your application must separately handle.A common misconception is that Lucene is an entire search application, when in fact it’s simply the core indexing and searching component.
A search application starts with an indexing chain, which in turn requires separate steps to retrieve the raw content; create documents from the content, possibly extracting text from binary documents; and index the documents. Once the index is built, the components required for searching are equally diverse, including a user interface, a means for building up a programmatic query, query execution (to retrieve matching documents), and results rendering. Search engines generally share a common overall architecture, as shown in below figure:
Let’s walk through a search application, one component at a time and we’ll also clearly point out which components Lucene can handle (the green background boxes from upper figure). We’ll then wrap up with a summary of Lucene’s role in your search application.
- Components for indexing
To search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. This conversion process is called indexing, and its output is called an index.
You can think of an index as a data structure that allows fast random access to words stored inside it. The concept behind it is analogous to an index at the end of a book, which lets you quickly locate pages that discuss certain topics. In the case of Lucene, an index is a specially designed data structure, typically stored on the file system as a set of index files.
Lucene, as a core search library, doesn’t provide any functionality to support acquiring content. This is entirely up to your application, or a separate piece of software. A number of open source crawlers are available, among them the following:
* Solr (http://lucene.apache.org/solr), a sister project under the Apache Lucene umbrella, has support for natively ingesting relational databases and XML feeds, as well as handling rich documents through Tika integration. (We cover Tika in chapter 7.)
* Nutch (http://lucene.apache.org/nutch), another sister project under the Apache Lucene umbrella, has a high-scale crawler that’s suitable for discovering content by crawling websites.
* Grub (http://www.grub.org) is a popular open source web crawler.
* Heritrix is Internet Archive’s open source crawler (http://crawler.archive.org).
* Droids, another subproject under the Apache Lucene umbrella, is currently under Apache incubation at http://incubator.apache.org/droids.
* Aperture (http://aperture.sourceforge.net) has support for crawling websites, file systems, and mail boxes and for extracting and indexing text.
* The Google Enterprise Connector Manager project (http://code.google.com/p/google-enterprise-connector-manager) provides connectors for a number of nonweb repositories.
Lucene provides an API for building fields and documents, but it doesn’t provide any logic to build a document because that’s entirely application specific. It also doesn’t provide any document filters, although Lucene has a sister project at Apache, Tika, which handles document filtering very well (see chapter 7).
Lucene provides an array of built-in analyzers that give you fine control over this process. It’s also straightforward to build your own analyzer, or create arbitrary analyzer chains combining Lucene’s tokenizers and token filters, to customize how tokens are created. The final step is to index the document.
We’re done reviewing the typical indexing steps for a search application and now we will visit the steps involved in searching.
- Components for searching
Searching is the process of looking up words in an index to find documents where they appear. The quality of a search is typically described using precision and recallmetrics. Besides them, you must consider a number of other factors when thinking about searching. We already mentioned speed and the ability to quickly search large quantities of text. Support for single and multiterm queries, phrase queries, wildcards, fuzzy queries, result ranking, and sorting are also important, as is a friendly syntax for entering those queries. Lucene offers a number of search features, bells, and whistles—so many that we had to spread our search coverage over three chapters (chapters 3, 5, and 6).
Let’s work through the typical components of a search engine, this time working top down in previous figure, starting with the search user interface.
SEARCH USER INTERFACE
Lucene doesn’t provide any default search UI; it’s entirely up to your application to build one. Once a user interacts with your search interface, she or he submits a search request, which first must be translated into an appropriate Query object for the search engine.
Query objects can be simple or complex. Lucene provides a powerful package, called QueryParser, to process the user’s text into a query object according to a common search syntax. We’ll cover it and its syntax in chapter 3, but it’s also fully described at http://lucene.apache.org/java/3_0_0/queryparsersyntax.html.
This component covers the complex inner workings of the search engine, and Lucene handles all of it for you. Lucene is also wonderfully extensible at this point, so if you’d like to customize how results are gathered, filtered, sorted, and so forth, it’s straightforward. See chapter 6 for details.
There are three common theoretical models of search:
* Pure Boolean model—Documents either match or don’t match the provided query, and no scoring is done. In this model there are no relevance scores associated with matching documents, and the matching documents are unordered; a query simply identifies a subset of the overall corpus as matching the query.
* Vector space model—Both queries and documents are modeled as vectors in a high dimensional space, where each unique term is a dimension. Relevance, or similarity, between a query and a document is computed by a vector distance measure between these vectors.
* Probabilistic model—In this model, you compute the probability that a document is a good match to a query using a full probabilistic approach.
Lucene’s approach combines the vector space and pure Boolean models, and offers you controls to decide which model you’d like to use on a search-by-search basis. Finally, Lucene returns documents that you next must render in a consumable way for your users.
We’ve finished reviewing the components of both the indexing and searching paths in a search application. For the rest components in the previous figure, you can go into detail of the book "Lucene in Action".
- Where Lucene fits into your application
A modern search application can require many components. Yet the needs of a specific application from each of these components vary greatly. Lucene covers many of these components (Green part of previous figure) well, but other components are best covered by complementary open source software or by your own custom application logic. It’s possible your application is specialized enough to not require certain components. You should at this point have a good sense of what we mean when we say Lucene is a search library, not a full application.
Now let’s see a concrete example of using Lucene for indexing and searching - A simple application tutorial
* Chinese Word Segementer for Lucene - 看來像是加上中文斷詞的程式
* Lucene 學習筆記(1), (2), (3)
* Lucene Tutorial
* Lucene API Document
* 深入 Lucene 索引機制
* A Short Introduction to Lucene