程式扎記: [ InAction Note ] Ch4. Lucene’s analysis process

2013年4月29日星期一

[ InAction Note ] Ch4. Lucene’s analysis process - Stemming analysis

Preface:
Our final analyzer pulls out all the stops. It has a ridiculous, yet descriptive name: PositionalPorterStopAnalyzer. This analyzer removes stop words, leaving positional holes where words are removed, and leverages a stemming filter.

The PorterStemFilter is shown in the class hierarchy in figure 4.5, but it isn’t used by any built-in analyzer. It stems words using the Porter stemming algorithm created by Dr. Martin Porter, and it’s best defined in his own words:

The Porter stemming algorithm (or “Porter stemmer”) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

In other words, the various forms of a word are reduced to a common root form. For example, the words breathe, breathes, breathing, and breathed, via the Porter stemmer, reduce to breath.

The Porter stemmer is one of many stemming algorithms. See section 8.2.1 for coverage of an extension to Lucene that implements the Snowball algorithm (also created by Dr. Porter). KStem is another stemming algorithm that has been adapted to Lucene (search Google for KStem and Lucene).

Next we’ll show how to use StopFilter to remove words but leave a positional hole behind, and then we’ll describe the full analyzer.

StopFilter leaves holes:
Stop-word removal brings up an interesting issue: what happens to the holes left by the words removed? Suppose you index “one is not enough.” The tokens emitted fromStopAnalyzer will be one and enough, with is and not thrown away. By default, StopAnalyzer accounts for the removed words by incrementing the position increment.

This is illustrated from the output of AnalyzerUtils.displayTokensWithPositions:

view plaincopy to clipboardprint?
AnalyzerUtils.displayTokensWithPositions(new StopAnalyzer(Version.LUCENE_30),  
        "The quick brown fox jumps over the lazy dog");  

Output:

2: [quick]
3: [brown]
4: [fox]
5: [jump]
6: [over]
8: [lazi]
9: [dog]

Positions 1 and 7 are missing due to the removal of the. If you have a need to disable the holes so that position increment is always 1, use StopFilter’ssetEnablePositionIncrements method. But be careful when doing so: your index won’t record the deleted words, so there can be surprising effects. For example, the phrase "one enough" will match the indexed phrase "one is not enough" if you don’t preserve the holes!

Stepping back a bit, the primary reason to remove stop words is because these words typically have no special meaning; they are the “glue” words required in any language. The problem is, because we’ve discarded them, we’ve lost some information, which may or may not be a problem for your application. For example, nonexact searches can still match the document, such as "a quick brown fox."

There’s an interesting alternative, called shingles, which are compound tokens created from multiple adjacent tokens. Lucene has a TokenFilter called ShingleFilter in the contrib analyzers module that creates shingles during analysis. We’ll describe it in more detail in section 8.2.3. With shingles, stop words are combined with adjacent words to make new tokens, such as the-quick. At search time, the same expansion is used. This enables precise phrase matching, because the stop words aren’t discarded. Using shingles yields good search performance because the number of documents containing the-quick is far fewer than the number containing the stop word the in any context.

Combining stemming and stop-word removal:
This custom analyzer uses a stop-word removal filter, enabled to maintain positional gaps and fed from a LowerCaseTokenizer. The results of the stop filter are fed to the Porter stemmer. Listing 4.12 shows the full implementation of this sophisticated analyzer. LowerCaseTokenizer kicks off the analysis process, feeding tokens through the stop-word removal filter and finally stemming the words using the built-in Porter stemmer.
- Listing 4.12 PositionalPorterStopAnalyzer: stemming and stop word removal

view plaincopy to clipboardprint?
package ch4;  
  
import java.io.Reader;  
import java.util.Set;  
  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.LowerCaseTokenizer;  
import org.apache.lucene.analysis.PorterStemFilter;  
import org.apache.lucene.analysis.StopAnalyzer;  
import org.apache.lucene.analysis.StopFilter;  
import org.apache.lucene.analysis.TokenStream;  
  
public class PositionalPorterStopAnalyzer extends Analyzer {  
    private Set stopWords;  
  
    public PositionalPorterStopAnalyzer() {  
        this(StopAnalyzer.ENGLISH_STOP_WORDS_SET);  
    }  
  
    public PositionalPorterStopAnalyzer(Set stopWords) {  
        this.stopWords = stopWords;  
    }  
  
    public TokenStream tokenStream(String fieldName, Reader reader) {  
        StopFilter stopFilter = new StopFilter(true, new LowerCaseTokenizer(reader), stopWords);  
        stopFilter.setEnablePositionIncrements(true);  
        return new PorterStemFilter(stopFilter);  
    }  
}  

Then you can test with below code:

view plaincopy to clipboardprint?
AnalyzerUtils.displayTokensWithPositions(new PositionalPorterStopAnalyzer(), "the quick fox jumps");  

The output will be:

2: [quick]
3: [fox]
4: [jump]

The "jumps" has been stemmed to be "jump"! For the implementation of displayTokensWithPositions, please refer here on topic Visualizing token positions.

程式扎記

標籤

2013年4月29日星期一

[ InAction Note ] Ch4. Lucene’s analysis process - Stemming analysis

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2013年4月29日 星期一

[ InAction Note ] Ch4. Lucene’s analysis process - Stemming analysis

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2013年4月29日星期一