程式扎記: [ InAction Note ] Ch1. Meet Lucene - A simple application

標籤

2012年10月7日 星期日

[ InAction Note ] Ch1. Meet Lucene - A simple application


Lucene in action: a sample application : 
To show you Lucene’s indexing and searching capabilities, we’ll use a pair of command-line applications: Indexer and Searcher. First we’ll index files in a directory; then we’ll search the created index. Before we can search with Lucene, we need to build an index, so we start with our Indexer application. 

- Creating an index 
A simple class called Indexer, which indexes all files in a directory ending with the .txt extension. When Indexer completes execution, it leaves behind a Lucene index for its sibling, Searcher (presented next in section 1.4.2). After the annotated code listing, we show you how to use Indexer; if it helps you to learn how Indexer is used before you see how it’s coded, go directly to the usage discussion that follows the code. 

USING INDEXER TO INDEX TEXT FILES 
Listing 1.1 shows the Indexer command-line program, originally written for Erik’s introductory Lucene article on java.net. It takes two arguments: 
* A path to a directory where we store the Lucene index
* A path to a directory that contains the files we want to index

Listing 1.1 Indexer, which indexes .txt files 
  1. package ch1;  
  2.   
  3. import java.io.File;  
  4. import java.io.FileFilter;  
  5. import java.io.FileReader;  
  6. import java.io.IOException;  
  7.   
  8. import org.apache.lucene.analysis.standard.StandardAnalyzer;  
  9. import org.apache.lucene.document.Document;  
  10. import org.apache.lucene.document.Field;  
  11. import org.apache.lucene.index.IndexWriter;  
  12. import org.apache.lucene.store.Directory;  
  13. import org.apache.lucene.store.FSDirectory;  
  14. import org.apache.lucene.util.Version;  
  15.   
  16. public class Indexer {  
  17.     private IndexWriter writer;  
  18.   
  19.     private static class TextFilesFilter implements FileFilter {  
  20.         public boolean accept(File path) {  
  21.             // 6) Index .txt only.  
  22.             return path.getName().toLowerCase().endsWith(".txt");  
  23.         }  
  24.     }  
  25.   
  26.     public Indexer(String indexDir) throws IOException {  
  27.         Directory dir = FSDirectory.open(new File(indexDir));  
  28.         // 3) Create Lucene IndexWriter.  
  29.         writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30),  
  30.                 true, IndexWriter.MaxFieldLength.UNLIMITED);  
  31.     }  
  32.       
  33.     public void close() throws IOException {  
  34.         // 4) Close IndexWriter  
  35.         writer.close();  
  36.     }  
  37.       
  38.     public int index(String dataDir, FileFilter filter) throws Exception {  
  39.         File[] files = new File(dataDir).listFiles();  
  40.         for (File f : files) {  
  41.             if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead()  
  42.                     && (filter == null || filter.accept(f))) {  
  43.                 indexFile(f);  
  44.             }  
  45.         }  
  46.         return writer.numDocs(); // 5) Return the number of indexed docs.  
  47.     }  
  48.       
  49.     protected Document getDocument(File f) throws Exception {  
  50.         Document doc = new Document();  
  51.         doc.add(new Field("contents"new FileReader(f))); // 7) Index file content.  
  52.         doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // 8) Index filename  
  53.         doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // 9) Index full path  
  54.         return doc;  
  55.     }  
  56.       
  57.     private void indexFile(File f) throws Exception {  
  58.         System.out.println("Indexing " + f.getCanonicalPath());  
  59.         Document doc = getDocument(f);  
  60.         writer.addDocument(doc); // 10) Add doc to Lucene index  
  61.     }  
  62.       
  63.     public static void main(String[] args) throws Exception {  
  64.         if (args.length != 2) {  
  65.             throw new IllegalArgumentException("Usage: java "  
  66.                     + Indexer.class.getName() + " ");  
  67.         }  
  68.         String indexDir = args[0];  // 1) Create index in this directory  
  69.         String dataDir = args[1];   // 2) Index *.txt from this directory  
  70.         long start = System.currentTimeMillis();  
  71.         Indexer indexer = new Indexer(indexDir);  
  72.         int numIndexed;  
  73.         try {  
  74.             numIndexed = indexer.index(dataDir, new TextFilesFilter());  
  75.         } finally {  
  76.             indexer.close();  
  77.         }  
  78.         long end = System.currentTimeMillis();  
  79.         System.out.println("Indexing " + numIndexed + " files took "  
  80.                 + (end - start) + " milliseconds");  
  81.     }  
  82. }  
This example intentionally focuses on plain text files with .txt extensions to keep things simple, while demonstrating Lucene’s usage and power. In chapter 7, we’ll show you how to index other common document types, such as Microsoft Word or Adobe PDF, using the Tika framework. Before seeing how to run Indexer, let’s talk a bit about the Version parameter you see as the first argument to StandardAnalyzer

VERSION PARAMETER 
This class defines enum constants, such as LUCENE_24 and LUCENE_29, referencing Lucene’s minor releases. When you pass one of these values, it instructs Lucene to match the settings and behavior of that particular release. Lucene will also emulate bugs present in that release and fixed in later releases, if the Lucene developers felt that fixing the bug would break backward compatibility of existing indexes. For each class that accepts a Version parameter, you’ll have to consult the Javadocs to see what settings and bugs are changed across versions. It hows seriously the Lucene developers take backward compatibility

Let’s use Indexer to build our first Lucene search index! 

RUNNING INDEXER 
假設你在當前目錄下有目錄 ./data 要進行 Indexing (有文件 doc1.txt, doc2.txt), 並打算將 index 的結果放在 ./index. 可以使用如下代碼利用類別 Indexer 進行 Indexing: 
  1. package ch1;  
  2.   
  3. import ch1.Indexer.TextFilesFilter;  
  4.   
  5. public class IndexerEx1 {  
  6.     public static void main(String[] args)  throws Exception{  
  7.         String indexDir = "./index";    // 1) Create index in this directory  
  8.         String dataDir = "./data";      // 2) Index *.txt from this directory  
  9.         long start = System.currentTimeMillis();  
  10.         Indexer indexer = new Indexer(indexDir);  
  11.         int numIndexed;  
  12.         try {  
  13.             numIndexed = indexer.index(dataDir, new TextFilesFilter());  
  14.         } finally {  
  15.             indexer.close();  
  16.         }  
  17.         long end = System.currentTimeMillis();  
  18.         System.out.println("Indexing " + numIndexed + " files took "  
  19.                 + (end - start) + " milliseconds");  
  20.   
  21.     }  
  22. }  
執行結果: 
Indexing C:\John\EclipseNTNUProj\LuceneLab\data\doc1.TXT
Indexing C:\John\EclipseNTNUProj\LuceneLab\data\doc2.TXT
Indexing 2 files took 223 milliseconds

In our example, each of the indexed files was small, but roughly 0.8 seconds to index a handful of text files is reasonably impressive. Indexing throughput is clearly important, and we cover it extensively in chapter 11. But generally, searching is far more important since an index is built once but searched many times. 

- Searching an index 
Searching in Lucene is as fast and simple as indexing; the power of this functionality is astonishing, as chapters 3, 5, and 6 will show you. For now, let’s look at Searcher, a command-line program that we’ll use to search the index created by Indexer. 

USING SEARCHER TO IMPLEMENT A SEARCH 
The Searcher program, originally written for Erik’s introductory Lucene article on java.net, complements Indexer and provides command-line searching capability. Listing 1.2 shows Searcher in its entirety. It takes two command-line arguments: 
* The path to the index created with Indexer
* A query to use to search the index

Listing 1.2 Searcher, which searches a Lucene index 
  1. package ch1;  
  2.   
  3. import java.io.File;  
  4. import java.io.IOException;  
  5.   
  6. import org.apache.lucene.analysis.standard.StandardAnalyzer;  
  7. import org.apache.lucene.document.Document;  
  8. import org.apache.lucene.queryParser.ParseException;  
  9. import org.apache.lucene.queryParser.QueryParser;  
  10. import org.apache.lucene.search.IndexSearcher;  
  11. import org.apache.lucene.search.Query;  
  12. import org.apache.lucene.search.ScoreDoc;  
  13. import org.apache.lucene.search.TopDocs;  
  14. import org.apache.lucene.store.Directory;  
  15. import org.apache.lucene.store.FSDirectory;  
  16. import org.apache.lucene.util.Version;  
  17.   
  18. public class Searcher {  
  19.     public static void search(String indexDir, String q) throws IOException, ParseException {  
  20.         // 3) Open index  
  21.         Directory dir = FSDirectory.open(new File(indexDir));  
  22.         IndexSearcher is = new IndexSearcher(dir);  
  23.           
  24.         // 4) Parser query  
  25.         QueryParser parser = new QueryParser(Version.LUCENE_30, "contents",  
  26.                 new StandardAnalyzer(Version.LUCENE_30));  
  27.         Query query = parser.parse(q);  
  28.           
  29.         // 5) Search index  
  30.         long start = System.currentTimeMillis();  
  31.         TopDocs hits = is.search(query, 10);  
  32.         long end = System.currentTimeMillis();  
  33.           
  34.         // 6) Write search stat  
  35.         System.err.println("Found " + hits.totalHits + " document(s) (in "  
  36.                 + (end - start) + " milliseconds) that matched query '" + q  
  37.                 + "':");  
  38.           
  39.         // 7) Retrieve matching docs  
  40.         for (ScoreDoc scoreDoc : hits.scoreDocs) {  
  41.             Document doc = is.doc(scoreDoc.doc);  
  42.             System.out.println(doc.get("fullpath"));  
  43.         }  
  44.           
  45.         // 8) Close IndexSearcher  
  46.         is.close();  
  47.     }  
  48.       
  49.     public static void main(String[] args) throws IllegalArgumentException,  
  50.             IOException, ParseException {  
  51.         if (args.length != 2) {  
  52.             throw new IllegalArgumentException("Usage: java "  
  53.                     + Searcher.class.getName() + " ");  
  54.         }  
  55.         String indexDir = args[0];  // 1) Parser provided index directory  
  56.         String q = args[1];         // 2) Parser provided query string  
  57.         search(indexDir, q);  
  58.     }  
  59. }  
RUNNING SEARCHER 
接著我們可以使用下面代碼對剛剛 indexing 的結果進行查詢(index 的結果在 ./index), 假設我們的要找的文件有關鍵字 "John", 則可以參考下面代碼: 
  1. package ch1;  
  2.   
  3. public class SearcherEx1 {  
  4.     public static void main(String[] args)  throws Exception{  
  5.         Searcher.search("./index""John");  
  6.     }  
  7. }  
執行結果: 
Found 1 document(s) (in 7 milliseconds) that matched query 'John':
C:\John\EclipseNTNUProj\LuceneLab\data\doc1.TXT

You can use more sophisticated queries, such as 'patent AND freedom' or 'patent AND NOT apache' or '+copyright +developers', and so on. Chapters 3, 5, and 6 cover various aspects of searching, including Lucene’s query syntax

Indexer’s parsing of command-line arguments and directory listings to look for text files and Searcher’s code that prints matched filenames based on a query to the standard output. But don’t let this fact, or the conciseness of the examples, tempt you into complacence: there’s a lot going on under the covers of Lucene. To effectively leverage Lucene, you must understand how it works and how to extend it when the need arises. The remainder of this book is dedicated to giving you these missing pieces. Next we’ll drill down into the core classes Lucene exposes for indexing and searching - Understanding the core searching/indexing classes

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!