程式扎記: [ InAction Note ] Ch1. Meet Lucene

Preface :
This chapter covers :

* Learning about Lucene
* Understanding the typical search application architecture
* Using the basic indexing API
* Working with the search API

Lucene is a powerful Java search library that lets you easily add search to any application. In recent years Lucene has become exceptionally popular and is now the most widely used information retrieval library: it powers the search features behind many websites and desktop applications.

In this chapter we cover the overall architecture of a typical search application and where Lucene fits. It’s crucial to recognize that Lucene is simply a search library, and you’ll need to handle the other components of a search application (crawling, document filtering, runtime server, user interface, administration, etc.) as your application requires. We show you how to perform basic indexing and searching with ready-to-use code examples.
Note.

Lucene is an active open source project. By the time you read this, likely Lucene’s APIs and features will have changed. This book is based on the 3.0.1 release of Lucene, and thanks to Lucene’s backward compatibility policy, all code samples should compile and run fine for future 3.x releases. If you encounter a problem, send an email to javauser@lucene.apache.org and Lucene’s large, passionate, and responsive community will surely help.

Dealing with information explosion :
As you’ll soon discover, Lucene provides a simple yet powerful core API that requires minimal understanding of full-text indexing and searching. You need to learn about only a handful of its classes in order to start integrating Lucene into an application. Because Lucene is a Java library, it doesn’t make assumptions about what it indexes and searches, which gives it an advantage over a number of other search applications. Its design is compact and simple, allowing Lucene to be easily embedded into desktop applications.

Lucene’s website, at http://lucene.apache.org/java, is a great place to learn more about the current status of Lucene. There you’ll find the tutorial, Javadocs for Lucene’s API for all recent releases, an issue-tracking system, links for downloading releases, and Lucene’s wiki (http://wiki.apache.org/lucene-java), which contains many community-created and -maintained pages.

- What Lucene can do
Lucene allows you to add search capabilities to your application. Lucene can index and make searchable any data that you can extract text from. Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can derive text from it.

- History of Lucene
Lucene was written by Doug Cutting;it was initially available for download from its home at the SourceForge website. It joined the Apache Software Foundation’s Jakarta family of high-quality open source Java products in September 2001 and became its own top-level Apache project in February 2005.

Lucene and the components of a search application :
It’s important to grasp the big picture so that you have a clear understanding of which parts Lucene can handle and which parts your application must separately handle.A common misconception is that Lucene is an entire search application, when in fact it’s simply the core indexing and searching component.

A search application starts with an indexing chain, which in turn requires separate steps to retrieve the raw content; create documents from the content, possibly extracting text from binary documents; and index the documents. Once the index is built, the components required for searching are equally diverse, including a user interface, a means for building up a programmatic query, query execution (to retrieve matching documents), and results rendering. Search engines generally share a common overall architecture, as shown in below figure:

Let’s walk through a search application, one component at a time and we’ll also clearly point out which components Lucene can handle (the green background boxes from upper figure). We’ll then wrap up with a summary of Lucene’s role in your search application.

- Components for indexing
To search large amounts of text quickly, you must first index that text and convert it into a format that will let you search it rapidly, eliminating the slow sequential scanning process. This conversion process is called indexing, and its output is called an index.

You can think of an index as a data structure that allows fast random access to words stored inside it. The concept behind it is analogous to an index at the end of a book, which lets you quickly locate pages that discuss certain topics. In the case of Lucene, an index is a specially designed data structure, typically stored on the file system as a set of index files.

ACQUIRE CONTENT

The first step, at the most right side of upper figure, is to acquire content. This process, which involves using a crawler or spider, gathers and scopes the content that needs to be indexed.

Lucene, as a core search library, doesn’t provide any functionality to support acquiring content. This is entirely up to your application, or a separate piece of software. A number of open source crawlers are available, among them the following:
* Solr (http://lucene.apache.org/solr), a sister project under the Apache Lucene umbrella, has support for natively ingesting relational databases and XML feeds, as well as handling rich documents through Tika integration. (We cover Tika in chapter 7.)
* Nutch (http://lucene.apache.org/nutch), another sister project under the Apache Lucene umbrella, has a high-scale crawler that’s suitable for discovering content by crawling websites.
* Grub (http://www.grub.org) is a popular open source web crawler.
* Heritrix is Internet Archive’s open source crawler (http://crawler.archive.org).
* Droids, another subproject under the Apache Lucene umbrella, is currently under Apache incubation at http://incubator.apache.org/droids.
* Aperture (http://aperture.sourceforge.net) has support for crawling websites, file systems, and mail boxes and for extracting and indexing text.
* The Google Enterprise Connector Manager project (http://code.google.com/p/google-enterprise-connector-manager) provides connectors for a number of nonweb repositories.

BUILD DOCUMENT

Once you have the raw content that needs to be indexed, you must translate the content into the units (usually called documents) used by the search engine. The document typically consists of several separately named fields with values, such as title, body, abstract, author, and url. You’ll have to carefully design how to divide the raw content into documents and fields as well as how to compute the value for each of those fields.

Lucene provides an API for building fields and documents, but it doesn’t provide any logic to build a document because that’s entirely application specific. It also doesn’t provide any document filters, although Lucene has a sister project at Apache, Tika, which handles document filtering very well (see chapter 7).

ANALYZE DOCUMENT

No search engine indexes text directly: rather, the text must be broken into a series of individual atomic elements called tokens. This is what happens during the Analyze Document step. Each token corresponds roughly to a “word” in the language, and this step determines how the textual fields in the document are divided into a series of tokens.

Lucene provides an array of built-in analyzers that give you fine control over this process. It’s also straightforward to build your own analyzer, or create arbitrary analyzer chains combining Lucene’s tokenizers and token filters, to customize how tokens are created. The final step is to index the document.

INDEX DOCUMENT

During the indexing step, the document is added to the index. Lucene provides everything necessary for this step, and works quite a bit of magic under a surprisingly simple API. Chapter 2 takes you through all the nitty-gritty steps for performing indexing.

We’re done reviewing the typical indexing steps for a search application and now we will visit the steps involved in searching.

- Components for searching
Searching is the process of looking up words in an index to find documents where they appear. The quality of a search is typically described using precision and recallmetrics. Besides them, you must consider a number of other factors when thinking about searching. We already mentioned speed and the ability to quickly search large quantities of text. Support for single and multiterm queries, phrase queries, wildcards, fuzzy queries, result ranking, and sorting are also important, as is a friendly syntax for entering those queries. Lucene offers a number of search features, bells, and whistles—so many that we had to spread our search coverage over three chapters (chapters 3, 5, and 6).

Let’s work through the typical components of a search engine, this time working top down in previous figure, starting with the search user interface.

SEARCH USER INTERFACE

The user interface is what users actually see, in the web browser, desktop application, or mobile device, when they interact with your search application.

Lucene doesn’t provide any default search UI; it’s entirely up to your application to build one. Once a user interacts with your search interface, she or he submits a search request, which first must be translated into an appropriate Query object for the search engine.

BUILD QUERY

When you manage to entice a user to use your search application, she or he issues a search request, often as the result of an HTML form or Ajax request submitted by a browser to your server. You must then translate the request into the search engine’s Query object. We call this the Build Query step.

Query objects can be simple or complex. Lucene provides a powerful package, called QueryParser, to process the user’s text into a query object according to a common search syntax. We’ll cover it and its syntax in chapter 3, but it’s also fully described at http://lucene.apache.org/java/3_0_0/queryparsersyntax.html.

SEARCH QUERY

Search Query is the process of consulting the search index and retrieving the documents matching the Query, sorted in the requested sort order.

This component covers the complex inner workings of the search engine, and Lucene handles all of it for you. Lucene is also wonderfully extensible at this point, so if you’d like to customize how results are gathered, filtered, sorted, and so forth, it’s straightforward. See chapter 6 for details.

There are three common theoretical models of search:
* Pure Boolean model—Documents either match or don’t match the provided query, and no scoring is done. In this model there are no relevance scores associated with matching documents, and the matching documents are unordered; a query simply identifies a subset of the overall corpus as matching the query.
* Vector space model—Both queries and documents are modeled as vectors in a high dimensional space, where each unique term is a dimension. Relevance, or similarity, between a query and a document is computed by a vector distance measure between these vectors.
* Probabilistic model—In this model, you compute the probability that a document is a good match to a query using a full probabilistic approach.

Lucene’s approach combines the vector space and pure Boolean models, and offers you controls to decide which model you’d like to use on a search-by-search basis. Finally, Lucene returns documents that you next must render in a consumable way for your users.

RENDER RESULTS

Once you have the raw set of documents that match the query, sorted in the right order, you then render them to the user in an intuitive, consumable manner.

We’ve finished reviewing the components of both the indexing and searching paths in a search application. For the rest components in the previous figure, you can go into detail of the book "Lucene in Action".

- Where Lucene fits into your application
A modern search application can require many components. Yet the needs of a specific application from each of these components vary greatly. Lucene covers many of these components (Green part of previous figure) well, but other components are best covered by complementary open source software or by your own custom application logic. It’s possible your application is specialized enough to not require certain components. You should at this point have a good sense of what we mean when we say Lucene is a search library, not a full application.

Now let’s see a concrete example of using Lucene for indexing and searching - A simple application tutorial

Supplement :
* Lucene：基於Java的全文檢索引擎簡介

Lucene的作者：Lucene的貢獻者Doug Cutting是一位資深全文索引/檢索專家，曾經是V-Twin搜索引擎(Apple的Copland操作系統的成就之一)的主要開發者，後在Excite擔任高級系統架構設計師，目前從事於一些INTERNET底層架構的研究。他貢獻出的Lucene的目標是為各種中小型應用程序加入全文檢索功能...

* 利用Lucene制作中文搜尋應用
* Chinese Word Segementer for Lucene - 看來像是加上中文斷詞的程式
* Lucene 學習筆記(1), (2), (3)
* Lucene Tutorial
* Lucene中的基本概念
* Lucene API Document
* 深入 Lucene 索引機制
* A Short Introduction to Lucene

程式扎記

標籤

2012年10月7日星期日

[ InAction Note ] Ch1. Meet Lucene

2 則留言:

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年10月7日 星期日