程式扎記

參考自這裡
前言 :
stanford parser 是一個可進行短語結構和依存結構分析的parser，網絡上的資料很多，而且在stanford nlp 的網站上也有很多說明，代碼中的 readme 文件數的也很詳細。在這裡簡要記錄一下我學習的一些過程. 這邊紀錄我在使用 "繁體中文" corpus 進行 training 運到的問題與最後使用的參數與測試.

中文語料訓練 :
stanford parser 的源代碼下載後可直接使用，不需要做任何修改。訓練語料默認是英文的wsj語料。在使用中文訓練時需要在參數中指定:
- 訓練 : 使用中文訓練時命令為

> java -mx4000m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-PCFG # 使用 Probabilistic Context Free Grammar
-vMarkov 1 # use no language-specific heuristics for unknown word processing
-uwm 0 # Always just choose the left-most category on a rule RHS as the head
-tLPP edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams # 指定 TreebankLangParserParams, for when using a different language or treebank
-saveToSerializedFile train3.ser.gz # 將 serialized model 輸出到 train3.ser.gz
-maxLength 100 # Specify the longest sentence that will be parsed
-escaper edu.stanford.nlp.trees.international.pennchinese.ChineseEscaper # Specify a class to do customized escaping of tokenized text.
-train train_test.txt # training corpus 的檔案(s)
-segmentMarkov # Makes it build in a segmenter, 這個選項可以忽略.
-encoding UTF-8 # 使用 UTF-8 encoding

其中一定要加 edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams，否則無法使用中文訓練，我在剛開始使用的時候沒有註意，總是出現 :

Extracting PCFG...Exception in thread "main" java.lang.RuntimeException:
> TreeAnnotator: null head found for tree [suggesting incomplete/wrong

在訓練的時候，可以選擇是使用PCFG還是Factored，有很多參數可選擇，具體看readme文件. 使用上面的訓練命令後得到一個.gz文件。接下來可進行測試.

- 測試 :
你可以使用下面的命令列進行測試:

> java -server -mx1800m -cp stanford-parser.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 200 -loadFromSerializedFile chinesePCFG.ser.gz -test ./corpus/ctb5/test.pid > ./test.result

或是自己寫代碼載入 training mode 並進行剖析 :

view plaincopy to clipboardprint?
package stanford.test;  
  
import java.util.List;  
import edu.stanford.nlp.ling.CoreLabel;  
import edu.stanford.nlp.ling.Sentence;  
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;  
import edu.stanford.nlp.trees.Tree;  
  
public class Test {  
  
    /** 
     * @param args 
     */  
    public static void main(String[] args) {  
        String sentence = "我 到 她 家 等候";  
        String sents[] = sentence.split(" ");  
        LexicalizedParser lp = LexicalizedParser.loadModel("train3.ser.gz");  
        List rawWords = Sentence.toCoreLabelList(sents);  
        Tree parse = lp.apply(rawWords);  
        System.out.printf("\t[Info] Parsing result:\n%s\n", parse.toString());  
    }  
}  

執行結果 :

Loading parser from serialized file train3.ser.gz ... done [0.1 sec].
[Info] Parsing result:
(ROOT (S (NP (Nh 我)) (PP (P 到) (NP (Nh 她) (Nc 家)) (VK 等候))))

This message was edited 16 times. Last update was at 14/08/2012 17:19:20

程式扎記

標籤

2012年8月14日星期二

[ Stanford parser ] Chinese corpus Training 參數的使用

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年8月14日 星期二