程式扎記: [ JLRToolkit ] Logistic Regression Toolkit

2012年11月27日星期二

[ JLRToolkit ] Logistic Regression Toolkit - Usage tutorial

Preface:
最近因為 PGM 課程需要使用到 Logistic Regression, 但卻又要求不能使用現有的工具 orz. 只好自己寫一個. 事實上在 [ ML In Action ] Logistic Regression 已經有簡單的說明 Logistic Regression 的原理並帶有 Python 的範例代碼. 所以這邊不對 Logistic Regression 原理多做說明, 而是針對我使用 Java 寫的 Logistic Regression Toolkit "JLRToolkit" 的使用進行講解. (完整的專案代碼可以在這裡下載.)

Data format:
這邊對可以餵進去 Toolkit 的 training data format 進行說明, 最簡單的就是假設你每一個 feature 都有值, 則你可以使用格式如下:

這邊符號 "\s" 指的是空白鍵, 你也可以使用 '\t' =Tab 鍵, 但是要再做設定告訴工具你使用的分隔符是什麼. 而在上面的格式就是每個 feature 的值依序出現在某一行中, 並以空白當作分隔, 而最後一個 item 為 label 或是 class 的值 (必須為整數). 如果 class 只有兩種, 那就是 0 或 1; 如果 class 不只兩種, 則 class 的值從 1 開始累加: 1, 2, 3... etc.

但有時候你的 feature 的值很 sparse, 就是很多時候 feature 是沒有值 (預設 Toolkit 會使用 0 當作沒有值的值), 則你可以使用下面的格式輸入 Training data:

也就是你可以使用 feature id 來告訴 toolkit 目前的 feature value 是屬於哪一個 feature, 這樣你就可以不用輸入那些沒有值或是值是零的 feature. 一樣最後一個 item 是 label/class 的值.

Usage Code Example:
接著我們如果要用寫代碼來使用這個工具, 可以參考這邊的範例. 首先考慮我們有一個 training data 如下:
- testSet.txt

-0.017612 14.053064 0
-1.395634 4.662541 1
-0.752157 6.538620 0
-1.322371 7.152853 0
0.423363 11.054677 0
0.406704 7.067335 1
0.667394 12.741452 0
-2.460150 6.866805 1
...

如果將第一欄的值當作 X (Feature1); 第二欄的值當作 Y (Feature2); 第三欄的值為 label/class 的值 (0 or 1), 並標示於平面座標如下:

(紅點為 label=0 的集合; 藍點為 label=1 的集合)

而我們希望 Logistic Regression 幫我們找一條方程式 w0+w1X + w2Y=t 讓我們可以區隔開來某個點是屬於 label=1 或是 label=0 的集合. 由方程式中 X 便是資料中第一欄的值 ; Y 便是是資料中第二欄的值. 也就是我們要找出來 w0, w1 與 w2. 而這邊可以發現 w0 的係數是 1, 故我們在原本 Training data 中補了一欄值都為 1 的欄位, 用來 training 出 w0. 故原先資料改寫如下:
- testSet2.txt

1 -0.017612 14.053064 0
1 -1.395634 4.662541 1
1 -0.752157 6.538620 0
1 -1.322371 7.152853 0
1 0.423363 11.054677 0
1 0.406704 7.067335 1
...

接著我們可以使用套件中 john.logisticreg.Train 類別進行 Logistic Regression 的 training, 範例代碼如下:

view plaincopy to clipboardprint?
Utils.SEP_CHAR="\t"; // 設定 separator char = Tab  
Train train = new Train(0.001, 150); // 設定 ALPHA=0.001; Loop iteration=150  
train.start(new File("testSet2.txt")); // 對 file=testSet2.txt 進行 training.  
train.saveModel(new File("Test.model")); // 將 Training 完得到的 weights 矩陣存到 Test.model 檔案中.  

執行 Log 如下:

[Info] Total 100 records; Feature size=3; Label size=2...
[Info] Default label=1...
[Info] Label=0:[-1.65, -0.12, 0.35]...
[Info] Label=1:[3.53, 0.81, -0.53]...
[Info] Training done! (0 sec)

現在我們有了 Training model, 並得到對應每個 label 的 weights: w0, w1, w2. 如果拿 label=1 的 weights 來看:

3.53 + 0.81X + -0.53Y = t

由 Sigmoid 的公式來看, 最佳的區隔效果出現在 t=0 的線上:

3.53 + 0.81X + -0.53Y = 0
Y = (3.53 + 0.81X) / 0.53

接著如果將得到的線性方程式畫到剛剛的座標平面上(綠色的線), 可以發現它不錯的區隔開來 label=0 (紅點) 與 label=1 (藍點) 的集合:

如果要用剛剛建立的 Training model 進行 prediction, 則可以使用套件中的類別 john.logisticreg.Predict 進行 prediction:

view plaincopy to clipboardprint?
Predict predict = new Predict(new File("Test.model")); // 載入 Model=Test.model  
System.out.printf("\t[Info] Loading model done...\n");  
Utils.APPEND_ANSWER=true; // 設定將 Answer也輸出到 output 的 result.  
predict.start(new File("testSet2.txt"), new File("testPredict.txt")); // 對 "testSet2.txt" 進行 prediction, 並輸出結果到 "testPredict.txt"  

執行 Log 如下:

[Info] Total 2 labels; Default label=1...
[Info] Loading model done...
[Info] Total predict 100 records...

這邊 Log 秀出來的 Default label=1 指的是我們會建立兩個 classifier for label=0;label=1. 當沒有一個 classifier 可以區隔你的 testing data 時, 仍需要給定一個答案時, 預設是猜 label=1 (因為它在 training data 中出現最多次. orz).

Console Model Usage:
這個套件同樣提供 Console mode 的使用方法, 首先來看看它提供的參數: