2012年12月25日 星期二

[ InAction Note ] Ch4. Lucene’s analysis process - Using the built-in analyzers

Preface: 
Lucene includes several built-in analyzers, created by chaining together certain combinations of the built-in Tokenizers and TokenFilters. The primary ones are shown intable 4.3. We’ll discuss certain language-specific contrib analyzers in section 4.8.2 and the special PerFieldAnalyzerWrapper in section 4.7.2
 

The built-in analyzers—WhitespaceAnalyzerSimpleAnalyzerStopAnalyzerKeywordAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers, except KeywordAnalyzer, in the output in section 4.1WhitespaceAnalyzer andSimpleAnalyzer are truly trivial: the one-line description in table 4.3 pretty much sums them up, so we don’t cover them further here. We cover KeywordAnalyzer in section 4.7.3. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have nontrivial effects. 

Visualizing analyzers: 
Normally, the tokens produced by analysis are silently absorbed by indexing. Yet seeing the tokens is a great way to gain a concrete understanding of the analysis process. In this section we’ll show you how to do just that. Specifically, we’ll show you the source code that generated the token examples here. Along the way we’ll see that a token consists of several interesting attributes, including term, positionIncrementoffsettypeflags, and payload

We begin with listing 4.1AnalyzerDemo, which analyzes two predefined phrases using Lucene’s core analyzers. Each phrase is analyzed by all the analyzers, then the tokens are displayed with bracketed output to indicate what would be indexed. 
- Listing 4.1 AnalyzerDemo: seeing analysis in action 
  1. package ch3;  
  2.   
  3. import java.io.IOException;  
  4. import john.utils.AnalyzerUtils;  
  5. import org.apache.lucene.analysis.Analyzer;  
  6. import org.apache.lucene.analysis.SimpleAnalyzer;  
  7. import org.apache.lucene.analysis.StopAnalyzer;  
  8. import org.apache.lucene.analysis.WhitespaceAnalyzer;  
  9. import org.apache.lucene.analysis.standard.StandardAnalyzer;  
  10. import org.apache.lucene.util.Version;  
  11.   
  12. public class AnalyzerDemo {  
  13.     private static final String[] examples = {  
  14.             "The quick brown fox jumped over the lazy dog",  
  15.             "XY&Z Corporation - xyz@example.com" };  
  16.   
  17.     private static final Analyzer[] analyzers = new Analyzer[] {  
  18.             new WhitespaceAnalyzer(), new SimpleAnalyzer(),  
  19.             new StopAnalyzer(Version.LUCENE_30),  
  20.             new StandardAnalyzer(Version.LUCENE_30) };  
  21.   
  22.     public static void main(String[] args) throws IOException {  
  23.         String[] strings = examples;  
  24.         if (args.length > 0) {  
  25.             strings = args;  
  26.         }  
  27.         for (String text : strings) {  
  28.             analyze(text);  
  29.         }  
  30.     }  
  31.   
  32.     private static void analyze(String text) throws IOException {  
  33.         System.out.println("Analyzing \"" + text + "\"");  
  34.         for (Analyzer analyzer : analyzers) {  
  35.             String name = analyzer.getClass().getSimpleName();  
  36.             System.out.println("  " + name + ":");  
  37.             System.out.print("    ");  
  38.             AnalyzerUtils.displayTokens(analyzer, text);  
  39.             System.out.println("\n");  
  40.         }  
  41.     }  
  42. }  
The real fun happens in AnalyzerUtils (listing 4.2), where the analyzer is applied to the text and the tokens are extracted. AnalyzerUtils passes text to an analyzer without indexing it and pulls the results in a manner similar to what happens during the indexing process under the covers of IndexWriter
Listing 4.2 AnalyzerUtils: delving into an analyzer 
  1. public static void displayTokens(Analyzer analyzer, String text) throws IOException {  
  2.     displayTokens(analyzer.tokenStream("contents"new StringReader(text)));  
  3. }  
  4.   
  5. public static void displayTokens(TokenStream stream) throws IOException {  
  6.     TermAttribute term = stream.addAttribute(TermAttribute.class);  
  7.     while (stream.incrementToken()) {  
  8.         System.out.print("[" + term.term() + "] ");  
  9.     }  
  10. }  
Execution result of Listing 4.1: 
 

Generally you wouldn’t invoke the analyzer’s tokenStream method explicitly except for this type of diagnostic or informational purpose. Note that the field name contents is arbitrary in the displayTokens() method. We recommend keeping a utility like this handy to see what tokens emit from your analyzers of choice. 

StopAnalyzer: 
StopAnalyzer, beyond doing basic word splitting and lowercasing, also removes special words called stop words. Stop words are words that are very common, such asthe, and thus assumed to carry very little standalone meaning for searching since nearly every document will contain the word. 

Embedded in StopAnalyzer is the following set of common English stop words, defined as ENGLISH_STOP_WORDS_SET
 

The StopAnalyzer has a second constructor that allows you to pass your own set instead. Under the hood, StopAnalyzer creates a StopFilter to perform the filtering.Section 4.6.1 describes StopFilter in more detail. 

StandardAnalyzer: 
StandardAnalyzer holds the honor as the most generally useful built-in analyzer. A JFlex-based grammar underlies it, tokenizing with cleverness for the following lexical types: alphanumerics, acronyms, company names, email addresses, computer hostnames, numbers, words with an interior apostrophe, serial numbers, IP addresses, and Chinese and Japanese characters. StandardAnalyzer also includes stop-word removal, using the same mechanism as the StopAnalyzer (identical default English set, and an optional Set constructor to override). StandardAnalyzer makes a great first choice. 

Using StandardAnalyzer is no different than using any of the other analyzers, as you can see from its use in section 4.1.1 and AnalyzerDemo (listing 4.1). Its unique effect, though, is apparent in the different treatment of text. For example, compare the different analyzers on the phrase “XY&Z Corporation - xyz@example.com” fromsection 4.1StandardAnalyzer is the only one that kept XY&Z together as well as the email address xyz@example.com; both of these showcase the vastly more sophisticated analysis process. 
 

Which core analyzer should you use? 
We’ve now seen the substantial differences in how each of the four core Lucene analyzers works. How do you choose the right one for your application? The answer may surprise you: most applications don’t use any of the built-in analyzers, and instead opt to create their own analyzer chain. For those applications that do use a core analyzer, StandardAnalyzer is likely the most common choice. The remaining core analyzers are usually far too simplistic for most applications, except perhaps for specific use cases (for example, a field that contains a list of part numbers might use WhitespaceAnalyzer). But these analyzers are great for test cases, and are indeed used heavily by Lucene’s unit tests. 

With that in mind, and now that you’re equipped with a strong foundational knowledge of Lucene’s analysis process. Typically an application has specific needs, such as customizing the stop-words list, performing special tokenization for application-specific tokens like part numbers or for synonym expansion, preserving case for certain tokens, or choosing a specific stemming algorithm. In fact, Solr makes it trivial to create your own analysis chain by expressing the chain directly as XML in solrconfig.xml. 

2012年12月24日 星期一

[ Python 文章收集 ] 解決IronPython沒有三元運算符的問題

來源自 這裡 
Preface: 
今天同事使用 IronPython 中的 Lambda 寫程序(我們的程序使用IronPython的Lambda功能),發現一個問題,假設有函數: c = a / b,可是 b 有可能為 0,如果為 0,那麼我們希望 c= 0,由於是 Lambda 表達式,所以必須使用一行話描述,可惜查資料發現 IronPython 不支持三元運算符,後來查資料,發現 Snowdream 兄寫了解決方案: Python學習筆記(3) 

Solution: 
修改後的程序是: b!=0 and a/b or 0,注意這裡使用了不等於,我們發現 b=0 時,不會運算 a/b,也就起到我們的目的. 

再次搜索網絡,發現其實有比較平滑的寫法: 
  1. Result = A / B if B <> 0 else 0  
即如果B不等於0,計算表達式,否則返回0.

2012年12月21日 星期五

[ Python 常見問題 ] How to make List from Numpy Matrix in Python


來源自 這裡
Question:
I using the dot() function from numpy to multiply a matrix of 3x3 with a numpy.array of 1x3. The output is for example this:
[[ 0.16666667 0.66666667 0.16666667]]

which is of type:
numpy.matrixlib.defmatrix.matrix
'>
how can I convert this to a list. Because I know the result will always be a matrix of 1x3 so it should be coverted to a list because I need to be able to loop through it later for calculation the pearson distance of two of those lists.

So to summarize: how can I make a list from this matrix?

Answer:
May not be the optimal way to do this but the following works:
  1. a = numpy.matrix([[ 0.166666670.666666670.16666667]])  
  2. list(numpy.array(a).reshape(-1,))  
or
  1. numpy.array(a).reshape(-1,).tolist()  
or
  1. numpy.array(a)[0].tolist()  
or
  1. numpy.array(a).flatten().tolist()  
This message was edited 1 time. Last update was at 22/12/2012 10:44:48

2012年12月19日 星期三

[ Python 文章收集 ] python內置函數 map/reduce/filter

來源自 這裡 
Preface: 
python有幾個內置的函數很有意思: map/filter/reduce,都是對一個集合進行處理,filter很容易理解用於過濾,map用於映射,reduce用于歸並. 是python列表方法的三架馬車. 

filter(bool_funciterable) 函數: 
filter函數的功能相當於過濾器。調用一個布林函數 bool_func 來迭代遍歷每個seq中的元素;返回一個使 iterable 內所有元素返回值為 true 的 元素的序列: 
>>>a=[1,2,3,4,5,6,7]
>>>b=filter(lambda x:x>5, a)
>>>b
[6,7] #返回大於5的元素集合

如果 bool_func 參數值為None,就使用 identity() 函數,list 參數中所有為 False 的元素都將被刪除. 如下所示: 
>>> a = [0, 1, 2, False, [], 4, '']
>>> b = filter(None, a)
>>> b
[1, 2, 4] # 空陣列, 空字串, 0, False 都被 identity() 函數視為 False

map(funciterable, ...) 函數: 
map 函數 func 作用於給定序列的每個元素進行 mapping/操作 並將 mapping/操作 後的元素以列表形式返回: 
>>> a = [1, 2, 3, 4, 5]
>>> b = map(lambda x:x+3, a)
>>> b
[4, 5, 6, 7, 8] # 將 a 中每個元素都加上3

reduce(funciterable[, initializer]) 函數: 
reduce 函數,func 為二元函數 (接收兩個參數, 第一個為上一次操作結果, 另一個為序列的下一個元素),將 func 作用於 iterable 序列的元素,每次攜帶一對, 連續的將現有的結果和下一個值作用在獲得的隨後的結果上, 最後減少我 們的序列為一個單一的返回值: 
>>> a = [1, 2, 3, 4, 5]
>>> reduce(lambda x,y:x+y, a)
15 # 計算 ((((1+2)+3)+4)+5)

在有給定參數 initializer 時 (且不是 None), initializer 將會被視為第一個作用後的元素. 也就是將 initializer 放在 iterable 最前面的元素. 當 iterable 為空時, initializer 則被返回 (等同initializer=None, 而 iterable 中只有一個元素時, 該元素被返回). 

[ Py DS ] Ch3 - Data Manipulation with Pandas (Part5)

Source From  Here   Pivot Tables   We have seen how the  GroupBy  abstraction lets us explore relationships within a dataset. A pivot ta...