程式扎記: [ InAction Note ] Ch4. Lucene’s analysis process - Synonyms, aliases, and words that mean the same

標籤

2013年1月6日 星期日

[ InAction Note ] Ch4. Lucene’s analysis process - Synonyms, aliases, and words that mean the same

Preface: 
How often have you searched for “spud” and been disappointed that the results didn’t include “potato”? Okay, maybe that precise example doesn’t happen often, but you get the idea: natural languages for some reason have evolved many ways to say the same thing. Such synonyms must be handled during searching, or your users won’t find their documents. 

Our next custom analyzer injects synonyms of words into the outgoing token stream during indexing but places the synonyms in the same position as the original word.By adding synonyms during indexing, searches will find documents that may not contain the original search terms but that match the synonyms of those words. We start with the test case showing how we expect our new analyzer to work, shown in listing 4.6. 
- Listing 4.6 Testing the synonym analyzer 
  1. public void testJumps() throws Exception {  
  2.     // Analyze with SynonymAnalyzer  
  3.     SynonymAnalyzer synonymAnalyzer = new SynonymAnalyzer( new TestSynonymEngine());  
  4.     TokenStream stream = synonymAnalyzer.tokenStream("contents"new StringReader("jumps"));  
  5.     TermAttribute term = stream.addAttribute(TermAttribute.class);  
  6.     PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);  
  7.     int i = 0;  
  8.     String[] expected = new String[] { "jumps""hops""leaps" };  
  9.     while (stream.incrementToken()) {  
  10.         assertEquals(expected[i], term.term());  
  11.         int expectedPos;  
  12.         if (i == 0) {  
  13.             expectedPos = 1;  
  14.         } else {  
  15.             expectedPos = 0;  
  16.         }  
  17.         assertEquals(expectedPos, posIncr.getPositionIncrement());  
  18.         i++;  
  19.     }  
  20.     assertEquals(3, i);  
  21. }  
Notice that our unit test shows not only that synonyms for the word jumps are emitted from the SynonymAnalyzer but also that the synonyms are placed in the same position (using an increment of 0) as the original word. Now that we see what behavior we expect of SynonymAnalyzer, let’s see how to build it. 

Creating SynonymAnalyzer: 
SynonymAnalyzer’s purpose is to first detect the occurrence of words that have synonyms, and second to insert the synonyms at the same position. Figure 4.6 graphically shows what our SynonymAnalyzer does to text input, and listing 4.7 is the implementation. 
 
Figure 4.6 SynonymAnalyzer visualized as factory automation 

- Listing 4.7 SynonymAnalyzer implementation 
  1. package ch4;  
  2.   
  3. import java.io.Reader;  
  4.   
  5. import org.apache.lucene.analysis.Analyzer;  
  6. import org.apache.lucene.analysis.LowerCaseFilter;  
  7. import org.apache.lucene.analysis.StopAnalyzer;  
  8. import org.apache.lucene.analysis.StopFilter;  
  9. import org.apache.lucene.analysis.TokenStream;  
  10. import org.apache.lucene.analysis.standard.StandardFilter;  
  11. import org.apache.lucene.analysis.standard.StandardTokenizer;  
  12. import org.apache.lucene.util.Version;  
  13.   
  14. public class SynonymAnalyzer extends Analyzer {  
  15.     private SynonymEngine engine;  
  16.   
  17.     public SynonymAnalyzer(SynonymEngine engine) {  
  18.         this.engine = engine;  
  19.     }  
  20.   
  21.     public TokenStream tokenStream(String fieldName, Reader reader) {  
  22.         SynonymAnalyzer synonymAnalyzer = new SynonymAnalyzer(new TestSynonymEngine());  
  23.         LowerCaseFilter lowercaseFilter = new LowerCaseFilter(new StandardFilter(new StandardTokenizer(Version.LUCENE_30, reader)));  
  24.         StopFilter stopFilter = new StopFilter(true, lowercaseFilter,StopAnalyzer.ENGLISH_STOP_WORDS_SET);  
  25.         TokenStream result = new SynonymFilter(stopFilter, engine);  
  26.         return result;  
  27.     }  
  28. }  
Once again, the analyzer code is minimal and simply chains a Tokenizer together with a series of TokenFilters; in fact, this is the StandardAnalyzer wrapped with an additional filter. (See table 4.1 for more on these basic analyzer building blocks.) The final TokenFilter in the chain is the new SynonymFilter (listing 4.8), which gets to the heart of the current discussion. When you’re injecting terms, buffering is needed. This filter uses a Stack as the buffer. 
- Listing 4.8 SynonymFilter: buffering tokens and emitting one at a time 
  1. package ch4;  
  2.   
  3. import java.io.IOException;  
  4. import java.util.Stack;  
  5.   
  6. import org.apache.lucene.analysis.TokenFilter;  
  7. import org.apache.lucene.analysis.TokenStream;  
  8. import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;  
  9. import org.apache.lucene.analysis.tokenattributes.TermAttribute;  
  10. import org.apache.lucene.util.AttributeSource;  
  11.   
  12. public class SynonymFilter extends TokenFilter {  
  13.     public static final String TOKEN_TYPE_SYNONYM = "SYNONYM";  
  14.     private Stack synonymStack;  
  15.     private SynonymEngine engine;  
  16.     private AttributeSource.State current;  
  17.     private final TermAttribute termAtt;  
  18.     private final PositionIncrementAttribute posIncrAtt;  
  19.   
  20.     public SynonymFilter(TokenStream in, SynonymEngine engine) {  
  21.         super(in);  
  22.         synonymStack = new Stack(); // 1. Define synonym buffer  
  23.         this.engine = engine;  
  24.         this.termAtt = addAttribute(TermAttribute.class);  
  25.         this.posIncrAtt = addAttribute(PositionIncrementAttribute.class);  
  26.     }  
  27.   
  28.     public boolean incrementToken() throws IOException {  
  29.         if (synonymStack.size() > 0) {             
  30.             String syn = synonymStack.pop(); // 2. Pop buffered synonyms  
  31.             System.out.printf("\t[Test] Add alias(%s) to stack...\n", syn);  
  32.             restoreState(current);  
  33.             termAtt.setTermBuffer(syn);  
  34.             posIncrAtt.setPositionIncrement(0); // 3. Set position increment to 0  
  35.             return true;  
  36.         }  
  37.         if (!input.incrementToken()) // 4. Read next token  
  38.             return false;         
  39.         if (addAliasesToStack()) { // 5. Push synonyms onto stack  
  40.             System.out.printf("\t[Test] Look alias(%d) to stack...\n", synonymStack.size());  
  41.             current = captureState(); // 6. Save current token  
  42.         }  
  43.         return true;  // 7. Return current token  
  44.     }  
  45.   
  46.     private boolean addAliasesToStack() throws IOException {  
  47.         String[] synonyms = engine.getSynonyms(termAtt.term()); // 8. Retrieve synonyms  
  48.         if (synonyms == null) {  
  49.             return false;  
  50.         }  
  51.         for (String synonym : synonyms) {  
  52.             synonymStack.push(synonym); // 9. Push synonyms onto stack  
  53.         }  
  54.         return true;  
  55.     }  
  56. }  
 

The design of SynonymAnalyzer allows for pluggable SynonymEngine implementations. SynonymEngine is a one-method interface: 
  1. package ch4;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. public interface SynonymEngine {  
  6.      String[] getSynonyms(String s) throws IOException;  
  7. }  
Using an interface for this design easily allows test implementations. We leave it as an exercise for you to create production-quality SynonymEngine implementations. For our examples, we use a simple test that’s hard-coded with a few synonyms: 
  1. package ch4;  
  2.   
  3. import java.util.HashMap;  
  4.   
  5. public class TestSynonymEngine implements SynonymEngine{  
  6.     private static HashMap map = new HashMap();  
  7.     static {  
  8.         map.put("quick"new String[] { "fast""speedy" });  
  9.         map.put("jumps"new String[] { "leaps""hops" });  
  10.         map.put("over"new String[] { "above" });  
  11.         map.put("lazy"new String[] { "apathetic""sluggish" });  
  12.         map.put("dog"new String[] { "canine""pooch" });  
  13.     }  
  14.   
  15.     public String[] getSynonyms(String s) {  
  16.         return map.get(s);  
  17.     }  
  18. }  
Notice that the synonyms generated by TestSynonymEngine are one-way: quick has the synonyms fast and speedy, but fast has no synonyms. In a real production environment, you should ensure all synonyms list one another as alternate synonyms, but because we’re using this for simple testing, it’s fine. 

Setting the position increment seems powerful, and indeed it is. You should only modify increments knowing of some odd cases that arise in searching, though. Because synonyms are indexed just like other terms, TermQuery works as expected. Also, PhraseQuery works as expected when we use a synonym in place of an original word. The SynonymAnalyzerTest test case in listing 4.9 demonstrates things working well using API-created queries. 
- Listing 4.9 SynonymAnalyzerTest: showing that synonym queries work 
  1. package ch4;  
  2.   
  3. import java.io.File;  
  4.   
  5. import junit.framework.TestCase;  
  6.   
  7. import org.apache.lucene.document.Document;  
  8. import org.apache.lucene.document.Field;  
  9. import org.apache.lucene.index.IndexReader;  
  10. import org.apache.lucene.index.IndexWriter;  
  11. import org.apache.lucene.index.IndexWriterConfig;  
  12. import org.apache.lucene.index.Term;  
  13. import org.apache.lucene.search.IndexSearcher;  
  14. import org.apache.lucene.search.PhraseQuery;  
  15. import org.apache.lucene.search.Query;  
  16. import org.apache.lucene.search.TermQuery;  
  17. import org.apache.lucene.search.TopDocs;  
  18. import org.apache.lucene.store.Directory;  
  19. import org.apache.lucene.store.FSDirectory;  
  20. import org.apache.lucene.util.Version;  
  21.   
  22. public class SynonymAnalyzerTest extends TestCase {  
  23.     public static Version LUCENE_VERSION = Version.LUCENE_30;  
  24.     public File idxRoot = new File("./test");  
  25.     private IndexSearcher searcher;  
  26.     public IndexWriter writer;  
  27.     private static SynonymAnalyzer synonymAnalyzer = new SynonymAnalyzer(new TestSynonymEngine());  
  28.   
  29.     public void setUp() throws Exception {  
  30.         Directory directory = FSDirectory.open(idxRoot);  
  31.         IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, synonymAnalyzer);      
  32.         writer = new IndexWriter(directory, iwConfig);  
  33.         writer.deleteAll();  
  34.         Document doc = new Document();  
  35.         doc.add(new Field("content",  
  36.                 "The quick brown fox jumps over the lazy dog", Field.Store.YES,  
  37.                 Field.Index.ANALYZED));  
  38.         writer.addDocument(doc);          
  39.         IndexReader reader = IndexReader.open(writer, true);  
  40.         searcher = new IndexSearcher(reader);  
  41.     }  
  42.   
  43.     public void tearDown() throws Exception {  
  44.         searcher.close();  
  45.         writer.close();  
  46.     }  
  47.       
  48.     public static int hitCount(IndexSearcher searcher, Query query) throws Exception  
  49.     {  
  50.         TopDocs matches = searcher.search(query, 10);  
  51.         return matches.totalHits;  
  52.     }  
  53.   
  54.     public void testSearchByAPI() throws Exception {  
  55.         TermQuery tq = new TermQuery(new Term("content""hops"));  
  56.         assertEquals(1, hitCount(searcher, tq));  
  57.         PhraseQuery pq = new PhraseQuery();  
  58.         pq.add(new Term("content""fox"));  
  59.         pq.add(new Term("content""hops"));  
  60.         assertEquals(1, hitCount(searcher, pq));  
  61.     }  
  62. }  
The phrase “…fox jumps…” was indexed, and our SynonymAnalyzer injected hops in the same position as jumps. A TermQuery for hops succeeded, as did an exactPhraseQuery for “fox hops.” Excellent! Let’s test it with QueryParser. We’ll run two tests. The first one creates QueryParser using our SynonymAnalyzer and the second one using StandardAnalyzer, as shown in listing 4.10. 
- Listing 4.10 Testing SynonymAnalyzer with QueryParser 
  1. public void testWithQueryParser() throws Exception {  
  2.     Query query = new QueryParser(Version.LUCENE_30, "content", synonymAnalyzer).parse("\"fox jumps\"");  
  3.     assertEquals(1, hitCount(searcher, query));  
  4.     System.out.println("With SynonymAnalyzer, \"fox jumps\" parses to " + query.toString("content"));  
  5.     query = new QueryParser(Version.LUCENE_30, "content"new StandardAnalyzer(Version.LUCENE_30)).parse("\"fox jumps\"");  
  6.     assertEquals(1, hitCount(searcher, query));  
  7.     System.out.println("With StandardAnalyzer, \"fox jumps\" parses to " + query.toString("content"));  
  8. }  
Both analyzers find the matching document just fine, which is great. The test produces the following output: 
With SynonymAnalyzer, "fox jumps" parses to "fox (jumps hops leaps)"
With StandardAnalyzer, "fox jumps" parses to "fox jumps"

As expected, with SynonymAnalyzer, words in our query were expanded to their synonyms. QueryParser is smart enough to notice that the tokens produced by the analyzer have zero position increment, and when that happens inside a phrase query, it creates a MultiPhraseQuery, described in section 5.3

But this is wasteful and unnecessary: we only need synonym expansion during indexing or during searching, not both. If you choose to expand during indexing, the disk space consumed by your index will be somewhat larger, but searching may be faster because there are fewer search terms to visit. Your synonyms have been baked into the index, so you don’t have the freedom to quickly change them and see the impact of such changes during searching. If instead you expand at search time, you can see fast turnaround when testing. These are simply trade-offs, and which option is best is your decision based on your application’s constraints. 

Visualizing token positions: 
Our AnalyzerUtils.displayTokens doesn’t show us all the information when dealing with analyzers that set position increments other than 1. To get a better view of these types of analyzers, we add an additional utility method, displayTokensWithPositions, to AnalyzerUtils, as shown in listing 4.11. 
- Listing 4.11 Visualizing the position increment of each token 
  1. public static void displayTokensWithPositions(Analyzer analyzer, String text) throws IOException {  
  2.     TokenStream stream = analyzer.tokenStream("contents"new StringReader(text));  
  3.     TermAttribute term = stream.addAttribute(TermAttribute.class);  
  4.     PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);  
  5.     int position = 0;  
  6.     while (stream.incrementToken()) {  
  7.         int increment = posIncr.getPositionIncrement();  
  8.         if (increment > 0) {  
  9.             position = position + increment;  
  10.             System.out.println();  
  11.             System.out.print(position + ": ");  
  12.         }  
  13.         System.out.print("[" + term.term() + "] ");  
  14.     }  
  15.     System.out.println();  
  16. }  
We wrote a quick piece of code to see what our SynonymAnalyzer is doing: 
  1. package ch4;  
  2.   
  3. import java.io.IOException;  
  4.   
  5. import john.utils.AnalyzerUtils;  
  6.   
  7. public class SynonymAnalyzerViewer {  
  8.     public static void main(String[] args) throws IOException {  
  9.         SynonymEngine engine = new TestSynonymEngine();  
  10.         AnalyzerUtils.displayTokensWithPositions(new SynonymAnalyzer(engine),  
  11.                 "The quick brown fox jumps over the lazy dog");  
  12.     }  
  13. }  
And we can now visualize the synonyms placed in the same positions as the original words: 
2: [quick] [speedy] [fast]
3: [brown]
4: [fox]
5: [jumps] [hops] [leaps]
6: [over] [above]
8: [lazy] [sluggish] [apathetic]
9: [dog] [pooch] [canine]

Each number on the left represents the token position. The numbers here are continuous, but they wouldn’t be if the analyzer left holes (as you’ll see with the next custom analyzer). Multiple terms shown for a single position illustrate where synonyms were added.

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!