程式扎記: [ InAction Note ] Ch1. Meet Lucene

Lucene in action: a sample application :
To show you Lucene’s indexing and searching capabilities, we’ll use a pair of command-line applications: Indexer and Searcher. First we’ll index files in a directory; then we’ll search the created index. Before we can search with Lucene, we need to build an index, so we start with our Indexer application.

- Creating an index
A simple class called Indexer, which indexes all files in a directory ending with the .txt extension. When Indexer completes execution, it leaves behind a Lucene index for its sibling, Searcher (presented next in section 1.4.2). After the annotated code listing, we show you how to use Indexer; if it helps you to learn how Indexer is used before you see how it’s coded, go directly to the usage discussion that follows the code.

USING INDEXER TO INDEX TEXT FILES
Listing 1.1 shows the Indexer command-line program, originally written for Erik’s introductory Lucene article on java.net. It takes two arguments:

* A path to a directory where we store the Lucene index
* A path to a directory that contains the files we want to index

Listing 1.1 Indexer, which indexes .txt files

view plaincopy to clipboardprint?
package ch1;  
  
import java.io.File;  
import java.io.FileFilter;  
import java.io.FileReader;  
import java.io.IOException;  
  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.util.Version;  
  
public class Indexer {  
    private IndexWriter writer;  
  
    private static class TextFilesFilter implements FileFilter {  
        public boolean accept(File path) {  
            // 6) Index .txt only.  
            return path.getName().toLowerCase().endsWith(".txt");  
        }  
    }  
  
    public Indexer(String indexDir) throws IOException {  
        Directory dir = FSDirectory.open(new File(indexDir));  
        // 3) Create Lucene IndexWriter.  
        writer = new IndexWriter(dir, new StandardAnalyzer(Version.LUCENE_30),  
                true, IndexWriter.MaxFieldLength.UNLIMITED);  
    }  
      
    public void close() throws IOException {  
        // 4) Close IndexWriter  
        writer.close();  
    }  
      
    public int index(String dataDir, FileFilter filter) throws Exception {  
        File[] files = new File(dataDir).listFiles();  
        for (File f : files) {  
            if (!f.isDirectory() && !f.isHidden() && f.exists() && f.canRead()  
                    && (filter == null || filter.accept(f))) {  
                indexFile(f);  
            }  
        }  
        return writer.numDocs(); // 5) Return the number of indexed docs.  
    }  
      
    protected Document getDocument(File f) throws Exception {  
        Document doc = new Document();  
        doc.add(new Field("contents", new FileReader(f))); // 7) Index file content.  
        doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // 8) Index filename  
        doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)); // 9) Index full path  
        return doc;  
    }  
      
    private void indexFile(File f) throws Exception {  
        System.out.println("Indexing " + f.getCanonicalPath());  
        Document doc = getDocument(f);  
        writer.addDocument(doc); // 10) Add doc to Lucene index  
    }  
      
    public static void main(String[] args) throws Exception {  
        if (args.length != 2) {  
            throw new IllegalArgumentException("Usage: java "  
                    + Indexer.class.getName() + "  ");  
        }  
        String indexDir = args[0];  // 1) Create index in this directory  
        String dataDir = args[1];   // 2) Index *.txt from this directory  
        long start = System.currentTimeMillis();  
        Indexer indexer = new Indexer(indexDir);  
        int numIndexed;  
        try {  
            numIndexed = indexer.index(dataDir, new TextFilesFilter());  
        } finally {  
            indexer.close();  
        }  
        long end = System.currentTimeMillis();  
        System.out.println("Indexing " + numIndexed + " files took "  
                + (end - start) + " milliseconds");  
    }  
}  

This example intentionally focuses on plain text files with .txt extensions to keep things simple, while demonstrating Lucene’s usage and power. In chapter 7, we’ll show you how to index other common document types, such as Microsoft Word or Adobe PDF, using the Tika framework. Before seeing how to run Indexer, let’s talk a bit about the Version parameter you see as the first argument to StandardAnalyzer.

VERSION PARAMETER
This class defines enum constants, such as LUCENE_24 and LUCENE_29, referencing Lucene’s minor releases. When you pass one of these values, it instructs Lucene to match the settings and behavior of that particular release. Lucene will also emulate bugs present in that release and fixed in later releases, if the Lucene developers felt that fixing the bug would break backward compatibility of existing indexes. For each class that accepts a Version parameter, you’ll have to consult the Javadocs to see what settings and bugs are changed across versions. It hows seriously the Lucene developers take backward compatibility.

Let’s use Indexer to build our first Lucene search index!

RUNNING INDEXER
假設你在當前目錄下有目錄 ./data 要進行 Indexing (有文件 doc1.txt, doc2.txt), 並打算將 index 的結果放在 ./index. 可以使用如下代碼利用類別 Indexer 進行 Indexing:

view plaincopy to clipboardprint?
package ch1;  
  
import ch1.Indexer.TextFilesFilter;  
  
public class IndexerEx1 {  
    public static void main(String[] args)  throws Exception{  
        String indexDir = "./index";    // 1) Create index in this directory  
        String dataDir = "./data";      // 2) Index *.txt from this directory  
        long start = System.currentTimeMillis();  
        Indexer indexer = new Indexer(indexDir);  
        int numIndexed;  
        try {  
            numIndexed = indexer.index(dataDir, new TextFilesFilter());  
        } finally {  
            indexer.close();  
        }  
        long end = System.currentTimeMillis();  
        System.out.println("Indexing " + numIndexed + " files took "  
                + (end - start) + " milliseconds");  
  
    }  
}  

執行結果:

Indexing C:\John\EclipseNTNUProj\LuceneLab\data\doc1.TXT
Indexing C:\John\EclipseNTNUProj\LuceneLab\data\doc2.TXT
Indexing 2 files took 223 milliseconds

In our example, each of the indexed files was small, but roughly 0.8 seconds to index a handful of text files is reasonably impressive. Indexing throughput is clearly important, and we cover it extensively in chapter 11. But generally, searching is far more important since an index is built once but searched many times.

- Searching an index
Searching in Lucene is as fast and simple as indexing; the power of this functionality is astonishing, as chapters 3, 5, and 6 will show you. For now, let’s look at Searcher, a command-line program that we’ll use to search the index created by Indexer.

USING SEARCHER TO IMPLEMENT A SEARCH
The Searcher program, originally written for Erik’s introductory Lucene article on java.net, complements Indexer and provides command-line searching capability. Listing 1.2 shows Searcher in its entirety. It takes two command-line arguments:

* The path to the index created with Indexer
* A query to use to search the index

Listing 1.2 Searcher, which searches a Lucene index

view plaincopy to clipboardprint?
package ch1;  
  
import java.io.File;  
import java.io.IOException;  
  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.queryParser.ParseException;  
import org.apache.lucene.queryParser.QueryParser;  
import org.apache.lucene.search.IndexSearcher;  
import org.apache.lucene.search.Query;  
import org.apache.lucene.search.ScoreDoc;  
import org.apache.lucene.search.TopDocs;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.util.Version;  
  
public class Searcher {  
    public static void search(String indexDir, String q) throws IOException, ParseException {  
        // 3) Open index  
        Directory dir = FSDirectory.open(new File(indexDir));  
        IndexSearcher is = new IndexSearcher(dir);  
          
        // 4) Parser query  
        QueryParser parser = new QueryParser(Version.LUCENE_30, "contents",  
                new StandardAnalyzer(Version.LUCENE_30));  
        Query query = parser.parse(q);  
          
        // 5) Search index  
        long start = System.currentTimeMillis();  
        TopDocs hits = is.search(query, 10);  
        long end = System.currentTimeMillis();  
          
        // 6) Write search stat  
        System.err.println("Found " + hits.totalHits + " document(s) (in "  
                + (end - start) + " milliseconds) that matched query '" + q  
                + "':");  
          
        // 7) Retrieve matching docs  
        for (ScoreDoc scoreDoc : hits.scoreDocs) {  
            Document doc = is.doc(scoreDoc.doc);  
            System.out.println(doc.get("fullpath"));  
        }  
          
        // 8) Close IndexSearcher  
        is.close();  
    }  
      
    public static void main(String[] args) throws IllegalArgumentException,  
            IOException, ParseException {  
        if (args.length != 2) {  
            throw new IllegalArgumentException("Usage: java "  
                    + Searcher.class.getName() + "  ");  
        }  
        String indexDir = args[0];  // 1) Parser provided index directory  
        String q = args[1];         // 2) Parser provided query string  
        search(indexDir, q);  
    }  
}  

RUNNING SEARCHER
接著我們可以使用下面代碼對剛剛 indexing 的結果進行查詢(index 的結果在 ./index), 假設我們的要找的文件有關鍵字 "John", 則可以參考下面代碼:

view plaincopy to clipboardprint?
package ch1;  
  
public class SearcherEx1 {  
    public static void main(String[] args)  throws Exception{  
        Searcher.search("./index", "John");  
    }  
}  

執行結果:

Found 1 document(s) (in 7 milliseconds) that matched query 'John':
C:\John\EclipseNTNUProj\LuceneLab\data\doc1.TXT

You can use more sophisticated queries, such as 'patent AND freedom' or 'patent AND NOT apache' or '+copyright +developers', and so on. Chapters 3, 5, and 6 cover various aspects of searching, including Lucene’s query syntax.

Indexer’s parsing of command-line arguments and directory listings to look for text files and Searcher’s code that prints matched filenames based on a query to the standard output. But don’t let this fact, or the conciseness of the examples, tempt you into complacence: there’s a lot going on under the covers of Lucene. To effectively leverage Lucene, you must understand how it works and how to extend it when the need arises. The remainder of this book is dedicated to giving you these missing pieces. Next we’ll drill down into the core classes Lucene exposes for indexing and searching - Understanding the core searching/indexing classes

程式扎記

標籤

2012年10月7日星期日

[ InAction Note ] Ch1. Meet Lucene - A simple application

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年10月7日 星期日