How often have you searched for “spud” and been disappointed that the results didn’t include “potato”? Okay, maybe that precise example doesn’t happen often, but you get the idea: natural languages for some reason have evolved many ways to say the same thing. Such synonyms must be handled during searching, or your users won’t find their documents.
Our next custom analyzer injects synonyms of words into the outgoing token stream during indexing but places the synonyms in the same position as the original word.By adding synonyms during indexing, searches will find documents that may not contain the original search terms but that match the synonyms of those words. We start with the test case showing how we expect our new analyzer to work, shown in listing 4.6.
- Listing 4.6 Testing the synonym analyzer
- public void testJumps() throws Exception {
- // Analyze with SynonymAnalyzer
- SynonymAnalyzer synonymAnalyzer = new SynonymAnalyzer( new TestSynonymEngine());
- TokenStream stream = synonymAnalyzer.tokenStream("contents", new StringReader("jumps"));
- TermAttribute term = stream.addAttribute(TermAttribute.class);
- PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);
- int i = 0;
- String[] expected = new String[] { "jumps", "hops", "leaps" };
- while (stream.incrementToken()) {
- assertEquals(expected[i], term.term());
- int expectedPos;
- if (i == 0) {
- expectedPos = 1;
- } else {
- expectedPos = 0;
- }
- assertEquals(expectedPos, posIncr.getPositionIncrement());
- i++;
- }
- assertEquals(3, i);
- }
Creating SynonymAnalyzer:
SynonymAnalyzer’s purpose is to first detect the occurrence of words that have synonyms, and second to insert the synonyms at the same position. Figure 4.6 graphically shows what our SynonymAnalyzer does to text input, and listing 4.7 is the implementation.
Figure 4.6 SynonymAnalyzer visualized as factory automation
- Listing 4.7 SynonymAnalyzer implementation
- package ch4;
- import java.io.Reader;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.LowerCaseFilter;
- import org.apache.lucene.analysis.StopAnalyzer;
- import org.apache.lucene.analysis.StopFilter;
- import org.apache.lucene.analysis.TokenStream;
- import org.apache.lucene.analysis.standard.StandardFilter;
- import org.apache.lucene.analysis.standard.StandardTokenizer;
- import org.apache.lucene.util.Version;
- public class SynonymAnalyzer extends Analyzer {
- private SynonymEngine engine;
- public SynonymAnalyzer(SynonymEngine engine) {
- this.engine = engine;
- }
- public TokenStream tokenStream(String fieldName, Reader reader) {
- SynonymAnalyzer synonymAnalyzer = new SynonymAnalyzer(new TestSynonymEngine());
- LowerCaseFilter lowercaseFilter = new LowerCaseFilter(new StandardFilter(new StandardTokenizer(Version.LUCENE_30, reader)));
- StopFilter stopFilter = new StopFilter(true, lowercaseFilter,StopAnalyzer.ENGLISH_STOP_WORDS_SET);
- TokenStream result = new SynonymFilter(stopFilter, engine);
- return result;
- }
- }
- Listing 4.8 SynonymFilter: buffering tokens and emitting one at a time
- package ch4;
- import java.io.IOException;
- import java.util.Stack;
- import org.apache.lucene.analysis.TokenFilter;
- import org.apache.lucene.analysis.TokenStream;
- import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
- import org.apache.lucene.analysis.tokenattributes.TermAttribute;
- import org.apache.lucene.util.AttributeSource;
- public class SynonymFilter extends TokenFilter {
- public static final String TOKEN_TYPE_SYNONYM = "SYNONYM";
- private Stack
synonymStack; - private SynonymEngine engine;
- private AttributeSource.State current;
- private final TermAttribute termAtt;
- private final PositionIncrementAttribute posIncrAtt;
- public SynonymFilter(TokenStream in, SynonymEngine engine) {
- super(in);
- synonymStack = new Stack
(); // 1. Define synonym buffer - this.engine = engine;
- this.termAtt = addAttribute(TermAttribute.class);
- this.posIncrAtt = addAttribute(PositionIncrementAttribute.class);
- }
- public boolean incrementToken() throws IOException {
- if (synonymStack.size() > 0) {
- String syn = synonymStack.pop(); // 2. Pop buffered synonyms
- System.out.printf("\t[Test] Add alias(%s) to stack...\n", syn);
- restoreState(current);
- termAtt.setTermBuffer(syn);
- posIncrAtt.setPositionIncrement(0); // 3. Set position increment to 0
- return true;
- }
- if (!input.incrementToken()) // 4. Read next token
- return false;
- if (addAliasesToStack()) { // 5. Push synonyms onto stack
- System.out.printf("\t[Test] Look alias(%d) to stack...\n", synonymStack.size());
- current = captureState(); // 6. Save current token
- }
- return true; // 7. Return current token
- }
- private boolean addAliasesToStack() throws IOException {
- String[] synonyms = engine.getSynonyms(termAtt.term()); // 8. Retrieve synonyms
- if (synonyms == null) {
- return false;
- }
- for (String synonym : synonyms) {
- synonymStack.push(synonym); // 9. Push synonyms onto stack
- }
- return true;
- }
- }
The design of SynonymAnalyzer allows for pluggable SynonymEngine implementations. SynonymEngine is a one-method interface:
- package ch4;
- import java.io.IOException;
- public interface SynonymEngine {
- String[] getSynonyms(String s) throws IOException;
- }
- package ch4;
- import java.util.HashMap;
- public class TestSynonymEngine implements SynonymEngine{
- private static HashMap
map = new HashMap(); - static {
- map.put("quick", new String[] { "fast", "speedy" });
- map.put("jumps", new String[] { "leaps", "hops" });
- map.put("over", new String[] { "above" });
- map.put("lazy", new String[] { "apathetic", "sluggish" });
- map.put("dog", new String[] { "canine", "pooch" });
- }
- public String[] getSynonyms(String s) {
- return map.get(s);
- }
- }
Setting the position increment seems powerful, and indeed it is. You should only modify increments knowing of some odd cases that arise in searching, though. Because synonyms are indexed just like other terms, TermQuery works as expected. Also, PhraseQuery works as expected when we use a synonym in place of an original word. The SynonymAnalyzerTest test case in listing 4.9 demonstrates things working well using API-created queries.
- Listing 4.9 SynonymAnalyzerTest: showing that synonym queries work
- package ch4;
- import java.io.File;
- import junit.framework.TestCase;
- import org.apache.lucene.document.Document;
- import org.apache.lucene.document.Field;
- import org.apache.lucene.index.IndexReader;
- import org.apache.lucene.index.IndexWriter;
- import org.apache.lucene.index.IndexWriterConfig;
- import org.apache.lucene.index.Term;
- import org.apache.lucene.search.IndexSearcher;
- import org.apache.lucene.search.PhraseQuery;
- import org.apache.lucene.search.Query;
- import org.apache.lucene.search.TermQuery;
- import org.apache.lucene.search.TopDocs;
- import org.apache.lucene.store.Directory;
- import org.apache.lucene.store.FSDirectory;
- import org.apache.lucene.util.Version;
- public class SynonymAnalyzerTest extends TestCase {
- public static Version LUCENE_VERSION = Version.LUCENE_30;
- public File idxRoot = new File("./test");
- private IndexSearcher searcher;
- public IndexWriter writer;
- private static SynonymAnalyzer synonymAnalyzer = new SynonymAnalyzer(new TestSynonymEngine());
- public void setUp() throws Exception {
- Directory directory = FSDirectory.open(idxRoot);
- IndexWriterConfig iwConfig = new IndexWriterConfig(LUCENE_VERSION, synonymAnalyzer);
- writer = new IndexWriter(directory, iwConfig);
- writer.deleteAll();
- Document doc = new Document();
- doc.add(new Field("content",
- "The quick brown fox jumps over the lazy dog", Field.Store.YES,
- Field.Index.ANALYZED));
- writer.addDocument(doc);
- IndexReader reader = IndexReader.open(writer, true);
- searcher = new IndexSearcher(reader);
- }
- public void tearDown() throws Exception {
- searcher.close();
- writer.close();
- }
- public static int hitCount(IndexSearcher searcher, Query query) throws Exception
- {
- TopDocs matches = searcher.search(query, 10);
- return matches.totalHits;
- }
- public void testSearchByAPI() throws Exception {
- TermQuery tq = new TermQuery(new Term("content", "hops"));
- assertEquals(1, hitCount(searcher, tq));
- PhraseQuery pq = new PhraseQuery();
- pq.add(new Term("content", "fox"));
- pq.add(new Term("content", "hops"));
- assertEquals(1, hitCount(searcher, pq));
- }
- }
- Listing 4.10 Testing SynonymAnalyzer with QueryParser
- public void testWithQueryParser() throws Exception {
- Query query = new QueryParser(Version.LUCENE_30, "content", synonymAnalyzer).parse("\"fox jumps\"");
- assertEquals(1, hitCount(searcher, query));
- System.out.println("With SynonymAnalyzer, \"fox jumps\" parses to " + query.toString("content"));
- query = new QueryParser(Version.LUCENE_30, "content", new StandardAnalyzer(Version.LUCENE_30)).parse("\"fox jumps\"");
- assertEquals(1, hitCount(searcher, query));
- System.out.println("With StandardAnalyzer, \"fox jumps\" parses to " + query.toString("content"));
- }
As expected, with SynonymAnalyzer, words in our query were expanded to their synonyms. QueryParser is smart enough to notice that the tokens produced by the analyzer have zero position increment, and when that happens inside a phrase query, it creates a MultiPhraseQuery, described in section 5.3.
But this is wasteful and unnecessary: we only need synonym expansion during indexing or during searching, not both. If you choose to expand during indexing, the disk space consumed by your index will be somewhat larger, but searching may be faster because there are fewer search terms to visit. Your synonyms have been baked into the index, so you don’t have the freedom to quickly change them and see the impact of such changes during searching. If instead you expand at search time, you can see fast turnaround when testing. These are simply trade-offs, and which option is best is your decision based on your application’s constraints.
Visualizing token positions:
Our AnalyzerUtils.displayTokens doesn’t show us all the information when dealing with analyzers that set position increments other than 1. To get a better view of these types of analyzers, we add an additional utility method, displayTokensWithPositions, to AnalyzerUtils, as shown in listing 4.11.
- Listing 4.11 Visualizing the position increment of each token
- public static void displayTokensWithPositions(Analyzer analyzer, String text) throws IOException {
- TokenStream stream = analyzer.tokenStream("contents", new StringReader(text));
- TermAttribute term = stream.addAttribute(TermAttribute.class);
- PositionIncrementAttribute posIncr = stream.addAttribute(PositionIncrementAttribute.class);
- int position = 0;
- while (stream.incrementToken()) {
- int increment = posIncr.getPositionIncrement();
- if (increment > 0) {
- position = position + increment;
- System.out.println();
- System.out.print(position + ": ");
- }
- System.out.print("[" + term.term() + "] ");
- }
- System.out.println();
- }
- package ch4;
- import java.io.IOException;
- import john.utils.AnalyzerUtils;
- public class SynonymAnalyzerViewer {
- public static void main(String[] args) throws IOException {
- SynonymEngine engine = new TestSynonymEngine();
- AnalyzerUtils.displayTokensWithPositions(new SynonymAnalyzer(engine),
- "The quick brown fox jumps over the lazy dog");
- }
- }
Each number on the left represents the token position. The numbers here are continuous, but they wouldn’t be if the analyzer left holes (as you’ll see with the next custom analyzer). Multiple terms shown for a single position illustrate where synonyms were added.
沒有留言:
張貼留言