程式扎記: [ InAction Note ] Ch4. Lucene’s analysis process

Preface:
Lucene includes several built-in analyzers, created by chaining together certain combinations of the built-in Tokenizers and TokenFilters. The primary ones are shown intable 4.3. We’ll discuss certain language-specific contrib analyzers in section 4.8.2 and the special PerFieldAnalyzerWrapper in section 4.7.2.

The built-in analyzers—WhitespaceAnalyzer, SimpleAnalyzer, StopAnalyzer, KeywordAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers, except KeywordAnalyzer, in the output in section 4.1. WhitespaceAnalyzer andSimpleAnalyzer are truly trivial: the one-line description in table 4.3 pretty much sums them up, so we don’t cover them further here. We cover KeywordAnalyzer in section 4.7.3. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have nontrivial effects.

Visualizing analyzers:
Normally, the tokens produced by analysis are silently absorbed by indexing. Yet seeing the tokens is a great way to gain a concrete understanding of the analysis process. In this section we’ll show you how to do just that. Specifically, we’ll show you the source code that generated the token examples here. Along the way we’ll see that a token consists of several interesting attributes, including term, positionIncrement, offset, type, flags, and payload.

We begin with listing 4.1, AnalyzerDemo, which analyzes two predefined phrases using Lucene’s core analyzers. Each phrase is analyzed by all the analyzers, then the tokens are displayed with bracketed output to indicate what would be indexed.
- Listing 4.1 AnalyzerDemo: seeing analysis in action

view plaincopy to clipboardprint?
package ch3;  
  
import java.io.IOException;  
import john.utils.AnalyzerUtils;  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.SimpleAnalyzer;  
import org.apache.lucene.analysis.StopAnalyzer;  
import org.apache.lucene.analysis.WhitespaceAnalyzer;  
import org.apache.lucene.analysis.standard.StandardAnalyzer;  
import org.apache.lucene.util.Version;  
  
public class AnalyzerDemo {  
    private static final String[] examples = {  
            "The quick brown fox jumped over the lazy dog",  
            "XY&Z Corporation - xyz@example.com" };  
  
    private static final Analyzer[] analyzers = new Analyzer[] {  
            new WhitespaceAnalyzer(), new SimpleAnalyzer(),  
            new StopAnalyzer(Version.LUCENE_30),  
            new StandardAnalyzer(Version.LUCENE_30) };  
  
    public static void main(String[] args) throws IOException {  
        String[] strings = examples;  
        if (args.length > 0) {  
            strings = args;  
        }  
        for (String text : strings) {  
            analyze(text);  
        }  
    }  
  
    private static void analyze(String text) throws IOException {  
        System.out.println("Analyzing \"" + text + "\"");  
        for (Analyzer analyzer : analyzers) {  
            String name = analyzer.getClass().getSimpleName();  
            System.out.println("  " + name + ":");  
            System.out.print("    ");  
            AnalyzerUtils.displayTokens(analyzer, text);  
            System.out.println("\n");  
        }  
    }  
}  

The real fun happens in AnalyzerUtils (listing 4.2), where the analyzer is applied to the text and the tokens are extracted. AnalyzerUtils passes text to an analyzer without indexing it and pulls the results in a manner similar to what happens during the indexing process under the covers of IndexWriter.
Listing 4.2 AnalyzerUtils: delving into an analyzer

view plaincopy to clipboardprint?
public static void displayTokens(Analyzer analyzer, String text) throws IOException {  
    displayTokens(analyzer.tokenStream("contents", new StringReader(text)));  
}  
  
public static void displayTokens(TokenStream stream) throws IOException {  
    TermAttribute term = stream.addAttribute(TermAttribute.class);  
    while (stream.incrementToken()) {  
        System.out.print("[" + term.term() + "] ");  
    }  
}  

Execution result of Listing 4.1:

Generally you wouldn’t invoke the analyzer’s tokenStream method explicitly except for this type of diagnostic or informational purpose. Note that the field name contents is arbitrary in the displayTokens() method. We recommend keeping a utility like this handy to see what tokens emit from your analyzers of choice.

StopAnalyzer:
StopAnalyzer, beyond doing basic word splitting and lowercasing, also removes special words called stop words. Stop words are words that are very common, such asthe, and thus assumed to carry very little standalone meaning for searching since nearly every document will contain the word.

Embedded in StopAnalyzer is the following set of common English stop words, defined as ENGLISH_STOP_WORDS_SET:

The StopAnalyzer has a second constructor that allows you to pass your own set instead. Under the hood, StopAnalyzer creates a StopFilter to perform the filtering.Section 4.6.1 describes StopFilter in more detail.

StandardAnalyzer:
StandardAnalyzer holds the honor as the most generally useful built-in analyzer. A JFlex-based grammar underlies it, tokenizing with cleverness for the following lexical types: alphanumerics, acronyms, company names, email addresses, computer hostnames, numbers, words with an interior apostrophe, serial numbers, IP addresses, and Chinese and Japanese characters. StandardAnalyzer also includes stop-word removal, using the same mechanism as the StopAnalyzer (identical default English set, and an optional Set constructor to override). StandardAnalyzer makes a great first choice.

Using StandardAnalyzer is no different than using any of the other analyzers, as you can see from its use in section 4.1.1 and AnalyzerDemo (listing 4.1). Its unique effect, though, is apparent in the different treatment of text. For example, compare the different analyzers on the phrase “XY&Z Corporation - xyz@example.com” fromsection 4.1. StandardAnalyzer is the only one that kept XY&Z together as well as the email address xyz@example.com; both of these showcase the vastly more sophisticated analysis process.

Which core analyzer should you use?
We’ve now seen the substantial differences in how each of the four core Lucene analyzers works. How do you choose the right one for your application? The answer may surprise you: most applications don’t use any of the built-in analyzers, and instead opt to create their own analyzer chain. For those applications that do use a core analyzer, StandardAnalyzer is likely the most common choice. The remaining core analyzers are usually far too simplistic for most applications, except perhaps for specific use cases (for example, a field that contains a list of part numbers might use WhitespaceAnalyzer). But these analyzers are great for test cases, and are indeed used heavily by Lucene’s unit tests.

With that in mind, and now that you’re equipped with a strong foundational knowledge of Lucene’s analysis process. Typically an application has specific needs, such as customizing the stop-words list, performing special tokenization for application-specific tokens like part numbers or for synonym expansion, preserving case for certain tokens, or choosing a specific stemming algorithm. In fact, Solr makes it trivial to create your own analysis chain by expressing the chain directly as XML in solrconfig.xml.

程式扎記

標籤

2012年12月25日星期二

[ InAction Note ] Ch4. Lucene’s analysis process - Using the built-in analyzers

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年12月25日 星期二