Lucene in action: a sample application :
To show you Lucene’s indexing and searching capabilities, we’ll use a pair of command-line applications: Indexer and Searcher. First we’ll index files in a directory; then we’ll search the created index. Before we can search with Lucene, we need to build an index, so we start with our Indexer application.
- Creating an index
A simple class called Indexer, which indexes all files in a directory ending with the .txt extension. When Indexer completes execution, it leaves behind a Lucene index for its sibling, Searcher (presented next in section 1.4.2). After the annotated code listing, we show you how to use Indexer; if it helps you to learn how Indexer is used before you see how it’s coded, go directly to the usage discussion that follows the code.
USING INDEXER TO INDEX TEXT FILES
Listing 1.1 shows the Indexer command-line program, originally written for Erik’s introductory Lucene article on java.net. It takes two arguments:
Listing 1.1 Indexer, which indexes .txt files
- package ch1;
- import java.io.File;
- import java.io.FileFilter;
- import java.io.FileReader;
- import java.io.IOException;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.Field;
- import org.apache.lucene.index.IndexWriter;
- import org.apache.lucene.store.Directory;
- import org.apache.lucene.store.FSDirectory;
- import org.apache.lucene.util.Version;
- public class Indexer {
- private IndexWriter writer;
- private static class TextFilesFilter implements FileFilter {
- public boolean accept(File path) {
- // 6) Index .txt only.
- return path.getName().toLowerCase().endsWith(".txt");
- }
- }
- public Indexer(String indexDir) throws IOException {
- Directory dir = FSDirectory.open(new File(indexDir));
- // 3) Create Lucene IndexWriter.
- writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30),
- true, IndexWriter.MaxFieldLength.UNLIMITED);
- }
- public void close() throws IOException {
- // 4) Close IndexWriter
- writer.close();
- }
- public int index(String dataDir, FileFilter filter) throws Exception {
- File[] files = new File(dataDir).listFiles();
- for (File f : files) {
- if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead()
- && (filter == null || filter.accept(f))) {
- indexFile(f);
- }
- }
- return writer.numDocs(); // 5) Return the number of indexed docs.
- }
- protected Document getDocument(File f) throws Exception {
- Document doc = new Document();
- doc.add(new Field("contents", new FileReader(f))); // 7) Index file content.
- doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // 8) Index filename
- doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // 9) Index full path
- return doc;
- }
- private void indexFile(File f) throws Exception {
- System.out.println("Indexing " + f.getCanonicalPath());
- Document doc = getDocument(f);
- writer.addDocument(doc); // 10) Add doc to Lucene index
- }
- public static void main(String[] args) throws Exception {
- if (args.length != 2) {
- throw new IllegalArgumentException("Usage: java "
- + Indexer.class.getName() + "
" ); - }
- String indexDir = args[0]; // 1) Create index in this directory
- String dataDir = args[1]; // 2) Index *.txt from this directory
- long start = System.currentTimeMillis();
- Indexer indexer = new Indexer(indexDir);
- int numIndexed;
- try {
- numIndexed = indexer.index(dataDir, new TextFilesFilter());
- } finally {
- indexer.close();
- }
- long end = System.currentTimeMillis();
- System.out.println("Indexing " + numIndexed + " files took "
- + (end - start) + " milliseconds");
- }
- }
VERSION PARAMETER
This class defines enum constants, such as LUCENE_24 and LUCENE_29, referencing Lucene’s minor releases. When you pass one of these values, it instructs Lucene to match the settings and behavior of that particular release. Lucene will also emulate bugs present in that release and fixed in later releases, if the Lucene developers felt that fixing the bug would break backward compatibility of existing indexes. For each class that accepts a Version parameter, you’ll have to consult the Javadocs to see what settings and bugs are changed across versions. It hows seriously the Lucene developers take backward compatibility.
Let’s use Indexer to build our first Lucene search index!
RUNNING INDEXER
假設你在當前目錄下有目錄 ./data 要進行 Indexing (有文件 doc1.txt, doc2.txt), 並打算將 index 的結果放在 ./index. 可以使用如下代碼利用類別 Indexer 進行 Indexing:
- package ch1;
- import ch1.Indexer.TextFilesFilter;
- public class IndexerEx1 {
- public static void main(String[] args) throws Exception{
- String indexDir = "./index"; // 1) Create index in this directory
- String dataDir = "./data"; // 2) Index *.txt from this directory
- long start = System.currentTimeMillis();
- Indexer indexer = new Indexer(indexDir);
- int numIndexed;
- try {
- numIndexed = indexer.index(dataDir, new TextFilesFilter());
- } finally {
- indexer.close();
- }
- long end = System.currentTimeMillis();
- System.out.println("Indexing " + numIndexed + " files took "
- + (end - start) + " milliseconds");
- }
- }
In our example, each of the indexed files was small, but roughly 0.8 seconds to index a handful of text files is reasonably impressive. Indexing throughput is clearly important, and we cover it extensively in chapter 11. But generally, searching is far more important since an index is built once but searched many times.
- Searching an index
Searching in Lucene is as fast and simple as indexing; the power of this functionality is astonishing, as chapters 3, 5, and 6 will show you. For now, let’s look at Searcher, a command-line program that we’ll use to search the index created by Indexer.
USING SEARCHER TO IMPLEMENT A SEARCH
The Searcher program, originally written for Erik’s introductory Lucene article on java.net, complements Indexer and provides command-line searching capability. Listing 1.2 shows Searcher in its entirety. It takes two command-line arguments:
Listing 1.2 Searcher, which searches a Lucene index
- package ch1;
- import java.io.File;
- import java.io.IOException;
- import org.apache.lucene.analysis.standard.StandardAnalyzer;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.queryParser.ParseException;
- import org.apache.lucene.queryParser.QueryParser;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.ScoreDoc;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.store.Directory;
- import org.apache.lucene.store.FSDirectory;
- import org.apache.lucene.util.Version;
- public class Searcher {
- public static void search(String indexDir, String q) throws IOException, ParseException {
- // 3) Open index
- Directory dir = FSDirectory.open(new File(indexDir));
- IndexSearcher is = new IndexSearcher(dir);
- // 4) Parser query
- QueryParser parser = new QueryParser(Version.LUCENE_30, "contents",
- new StandardAnalyzer(Version.LUCENE_30));
- Query query = parser.parse(q);
- // 5) Search index
- long start = System.currentTimeMillis();
- TopDocs hits = is.search(query, 10);
- long end = System.currentTimeMillis();
- // 6) Write search stat
- System.err.println("Found " + hits.totalHits + " document(s) (in "
- + (end - start) + " milliseconds) that matched query '" + q
- + "':");
- // 7) Retrieve matching docs
- for (ScoreDoc scoreDoc : hits.scoreDocs) {
- Document doc = is.doc(scoreDoc.doc);
- System.out.println(doc.get("fullpath"));
- }
- // 8) Close IndexSearcher
- is.close();
- }
- public static void main(String[] args) throws IllegalArgumentException,
- IOException, ParseException {
- if (args.length != 2) {
- throw new IllegalArgumentException("Usage: java "
- + Searcher.class.getName() + "
" - }
- String indexDir = args[0]; // 1) Parser provided index directory
- String q = args[1]; // 2) Parser provided query string
- search(indexDir, q);
- }
- }
接著我們可以使用下面代碼對剛剛 indexing 的結果進行查詢(index 的結果在 ./index), 假設我們的要找的文件有關鍵字 "John", 則可以參考下面代碼:
- package ch1;
- public class SearcherEx1 {
- public static void main(String[] args) throws Exception{
- Searcher.search("./index", "John");
- }
- }
You can use more sophisticated queries, such as 'patent AND freedom' or 'patent AND NOT apache' or '+copyright +developers', and so on. Chapters 3, 5, and 6 cover various aspects of searching, including Lucene’s query syntax.
Indexer’s parsing of command-line arguments and directory listings to look for text files and Searcher’s code that prints matched filenames based on a query to the standard output. But don’t let this fact, or the conciseness of the examples, tempt you into complacence: there’s a lot going on under the covers of Lucene. To effectively leverage Lucene, you must understand how it works and how to extend it when the need arises. The remainder of this book is dedicated to giving you these missing pieces. Next we’ll drill down into the core classes Lucene exposes for indexing and searching - Understanding the core searching/indexing classes
沒有留言:
張貼留言