2013年4月30日 星期二

[ Alg info ] Kruskal’s algorithm (MST)

Preface: 
Kruskal 演算法是一種用來尋找 MST(minimum spanning tree) 的演算法,由 Joseph Kruskal 在1956年發表。用來解決同樣問題的還有 Prim演算法 和 Boruvka演算法等。三種演算法都是 貪婪演算法 的應用。和 Boruvka演算法 不同的地方是,Kruskal演算法 在圖中存在相同權值的邊時也有效. 

Description: 
一般常見的 MST 演算法會採用下面的 pseudo code: 
  1. GENERIC-MST(G,w)  
  2. while A does not form a spanning tree  
  3.     find an edge (u,v) that is safe for A  
  4.     A=A⋃{(u,v)}  
  5. return A  
接著解釋一下上面的 pseudo code: 
* A 用來存放 MST edge 的 set.
* "find an edge (u,v) that is safe for A" 指的是加上該 edge 後, 不會讓 A 中的 forest 形成 loop (acyclic).
* "A=A⋃{(u,v)}" 指把 edge (u,v) 加到 A 中.

也就是說這個 pseudo code 會: 
* A 是最終的 MST 的edge的subset
* 隨時 A 都保持acyclic (No cycle)
* 每次都選一個edge, 把 A 形成的 graph 中兩棵 tree 連接起來 (因為不能出現cycle)
* 最後共選|V|-1條邊

Kruskal’s algorithm: 
要開始講這個演算法前, 為了要看懂它的 pseudo code, 先要來知道會用到的 routine: 
* MAKE-SET(v): 創造一個set 包含 v.
* FIND-SET(v): 找到set 包含 v.
* UNION(u,v): 把兩個set合併起來

接著是該演算法的 pseudo code: 
  1. MST-KRUSKAL(G,w)  
  2. A={}  
  3. for each vertex v∈G.V  
  4.     MAKE-SET(v)  
  5. sort G.E by w  
  6. for each edge (u,v) ∈G.E  
  7.     if FIND-SET(u)≠FIND-SET(v)  
  8.         A=A⋃{(u,v)}  
  9.         UNION(u,v)  
  10. return A   
接著簡單說明如下: 
1. 初始時, 每個 vertex 各自屬於一個 set
2. 接著對 Graph 的 edge's weight 進行 sorting (從小到大)
3. Loop edges
3.1 如果該 edge(u,v) 的 u 與 v 分屬不同的 set, 則將該 edge 加到 A
3.2 將分屬於 u 與 v 的 set 進行 union.

Example: 
接著我們來看一個範例, 考慮有 Graph 如下: 
 
1. Sorting 過後, 第一個取出來的 edge(0,5). 因為 vertex 0, 5 分屬不同的 set, 故將此 edge 加到 A 
 

2. 接著第二個取出來的 edge(2,3). 因為 vertex 2, 3 分屬不同的 set, 故將此 edge 加到 A 
 

3. 接著第三個取出來的 edge(1,6). 因為 vertex 1, 6 分屬不同的 set, 故將此 edge 加到 A 
 

4. 接著第四個取出來的 edge(1,2). 因為 vertex 1, 2 分屬不同的 set, 故將此 edge 加到 A 
 

5. 接著第五個取出來的 edge(3,6). 因為 vertex 3, 6 屬於同一個 set {(1,2), (1,6), (2,3)}, 故捨棄此 edge! 
 

6. 接著第六個取出來的 edge(3,4). 因為 vertex 3, 4 分屬不同的 set, 故將此 edge 加到 A 
 

7. 接著第七個取出來的 edge(4,6). 因為 vertex 4, 6 屬於同一個 set {(1,2), (1,6), (2,3), (3,4)}, 故捨棄此 edge! 
 

8. 接著第八個取出來的 edge(4,5). 因為 vertex 4, 5 分屬不同的 set, 故將此 edge 加到 A; 且此時 A 有 6 個 edges = |V|-1. 故 MST 已經完成
 

Supplement: 
[ Alg info ] Borůvka's algorithm (MST) 
[ Data Structures with Java ] Section 25.6 : Minimum Spanning Tree 
[ 資料結構 小學堂 ] 圖形結構 : 擴張樹 (Prim) 
[ 資料結構 小學堂 ] 圖形結構 : 擴張樹 - 求最小成本擴張樹 (Kruskal 演算法)

2013年4月29日 星期一

[ InAction Note ] Ch4. Lucene’s analysis process - Stemming analysis

Preface: 
Our final analyzer pulls out all the stops. It has a ridiculous, yet descriptive name: PositionalPorterStopAnalyzer. This analyzer removes stop words, leaving positional holes where words are removed, and leverages a stemming filter. 

The PorterStemFilter is shown in the class hierarchy in figure 4.5, but it isn’t used by any built-in analyzer. It stems words using the Porter stemming algorithm created by Dr. Martin Porter, and it’s best defined in his own words: 
The Porter stemming algorithm (or “Porter stemmer”) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

In other words, the various forms of a word are reduced to a common root form. For example, the words breathebreathesbreathing, and breathed, via the Porter stemmer, reduce to breath

The Porter stemmer is one of many stemming algorithms. See section 8.2.1 for coverage of an extension to Lucene that implements the Snowball algorithm (also created by Dr. Porter). KStem is another stemming algorithm that has been adapted to Lucene (search Google for KStem and Lucene). 

Next we’ll show how to use StopFilter to remove words but leave a positional hole behind, and then we’ll describe the full analyzer. 

StopFilter leaves holes: 
Stop-word removal brings up an interesting issue: what happens to the holes left by the words removed? Suppose you index “one is not enough.” The tokens emitted fromStopAnalyzer will be one and enough, with is and not thrown away. By default, StopAnalyzer accounts for the removed words by incrementing the position increment. 

This is illustrated from the output of AnalyzerUtils.displayTokensWithPositions
  1. AnalyzerUtils.displayTokensWithPositions(new StopAnalyzer(Version.LUCENE_30),  
  2.         "The quick brown fox jumps over the lazy dog");  
Output: 
2: [quick]
3: [brown]
4: [fox]
5: [jump]
6: [over]
8: [lazi]
9: [dog]

Positions 1 and 7 are missing due to the removal of the. If you have a need to disable the holes so that position increment is always 1, use StopFilter’ssetEnablePositionIncrements method. But be careful when doing so: your index won’t record the deleted words, so there can be surprising effects. For example, the phrase "one enough" will match the indexed phrase "one is not enough" if you don’t preserve the holes

Stepping back a bit, the primary reason to remove stop words is because these words typically have no special meaning; they are the “glue” words required in any language. The problem is, because we’ve discarded them, we’ve lost some information, which may or may not be a problem for your application. For example, nonexact searches can still match the document, such as "a quick brown fox.

There’s an interesting alternative, called shingles, which are compound tokens created from multiple adjacent tokens. Lucene has a TokenFilter called ShingleFilter in the contrib analyzers module that creates shingles during analysis. We’ll describe it in more detail in section 8.2.3With shingles, stop words are combined with adjacent words to make new tokens, such as the-quick. At search time, the same expansion is used. This enables precise phrase matching, because the stop words aren’t discarded. Using shingles yields good search performance because the number of documents containing the-quick is far fewer than the number containing the stop word the in any context. 

Combining stemming and stop-word removal: 
This custom analyzer uses a stop-word removal filter, enabled to maintain positional gaps and fed from a LowerCaseTokenizer. The results of the stop filter are fed to the Porter stemmer. Listing 4.12 shows the full implementation of this sophisticated analyzer. LowerCaseTokenizer kicks off the analysis process, feeding tokens through the stop-word removal filter and finally stemming the words using the built-in Porter stemmer. 
- Listing 4.12 PositionalPorterStopAnalyzer: stemming and stop word removal 
  1. package ch4;  
  2.   
  3. import java.io.Reader;  
  4. import java.util.Set;  
  5.   
  6. import org.apache.lucene.analysis.Analyzer;  
  7. import org.apache.lucene.analysis.LowerCaseTokenizer;  
  8. import org.apache.lucene.analysis.PorterStemFilter;  
  9. import org.apache.lucene.analysis.StopAnalyzer;  
  10. import org.apache.lucene.analysis.StopFilter;  
  11. import org.apache.lucene.analysis.TokenStream;  
  12.   
  13. public class PositionalPorterStopAnalyzer extends Analyzer {  
  14.     private Set stopWords;  
  15.   
  16.     public PositionalPorterStopAnalyzer() {  
  17.         this(StopAnalyzer.ENGLISH_STOP_WORDS_SET);  
  18.     }  
  19.   
  20.     public PositionalPorterStopAnalyzer(Set stopWords) {  
  21.         this.stopWords = stopWords;  
  22.     }  
  23.   
  24.     public TokenStream tokenStream(String fieldName, Reader reader) {  
  25.         StopFilter stopFilter = new StopFilter(truenew LowerCaseTokenizer(reader), stopWords);  
  26.         stopFilter.setEnablePositionIncrements(true);  
  27.         return new PorterStemFilter(stopFilter);  
  28.     }  
  29. }  
Then you can test with below code: 
  1. AnalyzerUtils.displayTokensWithPositions(new PositionalPorterStopAnalyzer(), "the quick fox jumps");  
The output will be: 
2: [quick]
3: [fox]
4: [
jump]

The "jumps" has been stemmed to be "jump"! For the implementation of displayTokensWithPositions, please refer here on topic Visualizing token positions.

[ Java 代碼範本 ] How To Create XML File In Java – (DOM Parser)

來源自 這裡 
Preface: 
DOM provides many handy classes to create XML file easily. Firstly, you have to create a Document with DocumentBuilder class, define all the XML content – node, attribute with Element class. In last, use Transformer class to output the entire XML content to stream output, typically a File. 

In this tutorial, we show you how to use DOM XML parser to create a XML file. 

DOM Parser Example: 
At the end of the example, following XML file named “file.xml” will be created: 
 

範例代碼如下: 
- WriteXMLFile.java – Java class to create a XML file. 
  1. package com.mkyong.core;  
  2.   
  3. import java.io.File;  
  4. import javax.xml.parsers.DocumentBuilder;  
  5. import javax.xml.parsers.DocumentBuilderFactory;  
  6. import javax.xml.parsers.ParserConfigurationException;  
  7. import javax.xml.transform.Transformer;  
  8. import javax.xml.transform.TransformerException;  
  9. import javax.xml.transform.TransformerFactory;  
  10. import javax.xml.transform.dom.DOMSource;  
  11. import javax.xml.transform.stream.StreamResult;  
  12.   
  13. import org.w3c.dom.Attr;  
  14. import org.w3c.dom.Document;  
  15. import org.w3c.dom.Element;  
  16.   
  17. public class WriteXMLFile {  
  18.   
  19.     public static void main(String argv[]) {  
  20.   
  21.       try {  
  22.   
  23.         DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();  
  24.         DocumentBuilder docBuilder = docFactory.newDocumentBuilder();  
  25.   
  26.         // root elements  
  27.         Document doc = docBuilder.newDocument();  
  28.         Element rootElement = doc.createElement("company");  
  29.         doc.appendChild(rootElement);  
  30.   
  31.         // staff elements  
  32.         Element staff = doc.createElement("Staff");  
  33.         rootElement.appendChild(staff);  
  34.   
  35.         // set attribute to staff element  
  36.         Attr attr = doc.createAttribute("id");  
  37.         attr.setValue("1");  
  38.         staff.setAttributeNode(attr);  
  39.   
  40.         // shorten way  
  41.         // staff.setAttribute("id", "1");  
  42.   
  43.         // firstname elements  
  44.         Element firstname = doc.createElement("firstname");  
  45.         firstname.appendChild(doc.createTextNode("yong"));  
  46.         staff.appendChild(firstname);  
  47.   
  48.         // lastname elements  
  49.         Element lastname = doc.createElement("lastname");  
  50.         lastname.appendChild(doc.createTextNode("mook kim"));  
  51.         staff.appendChild(lastname);  
  52.   
  53.         // nickname elements  
  54.         Element nickname = doc.createElement("nickname");  
  55.         nickname.appendChild(doc.createTextNode("mkyong"));  
  56.         staff.appendChild(nickname);  
  57.   
  58.         // salary elements  
  59.         Element salary = doc.createElement("salary");  
  60.         salary.appendChild(doc.createTextNode("100000"));  
  61.         staff.appendChild(salary);  
  62.   
  63.         // write the content into xml file  
  64.         TransformerFactory transformerFactory = TransformerFactory.newInstance();  
  65.         Transformer transformer = transformerFactory.newTransformer();  
  66.         DOMSource source = new DOMSource(doc);  
  67.         StreamResult result = new StreamResult(new File("C:\\file.xml"));  
  68.   
  69.         // Output to console for testing  
  70.         // StreamResult result = new StreamResult(System.out);  
  71.   
  72.         transformer.transform(source, result);  
  73.   
  74.         System.out.println("File saved!");  
  75.   
  76.       } catch (ParserConfigurationException pce) {  
  77.         pce.printStackTrace();  
  78.       } catch (TransformerException tfe) {  
  79.         tfe.printStackTrace();  
  80.       }  
  81.     }  
  82. }  
A new XML file is created in “C:\\file.xml“, with default UTF-8 encoded. For debugging, you can change the StreamResult to output the XML content to your console. 
  1. StreamResult result =  new StreamResult(System.out);  
  2. transformer.transform(source, result);  
Supplement: 
Java中四種操作(DOM、SAX、JDOM、DOM4J)xml方式的比較與詳解

[ Py DS ] Ch3 - Data Manipulation with Pandas (Part5)

Source From  Here   Pivot Tables   We have seen how the  GroupBy  abstraction lets us explore relationships within a dataset. A pivot ta...