程式扎記: [ ML In Action ] Predicting numeric values : regression

2012年9月3日星期一

[ ML In Action ] Predicting numeric values : regression - Linear regression (1)

Preface :
在前面的 kNN, Decision tree etc, 都是做 discrete nominal value prediction 的 classification. 這邊要介紹的 "regression" 是指 Supervised learning 中對 continuous value 的預測. 通常你必須先決定 "方程式", 而方程式的變數值是由 test data set 提供, 而變數上的參數我們統稱為 regression weights. 而我們要做的便是找出最佳的 regression weights 讓方程式計算出來每個 test data set 出來的 "運算值" 與 "期待值" 間差的合為最小.

Finding best-fit lines with linear regression :
有了上面的介紹, 來看看 Linear regression 的特性 :
Linear regression

Pros: Easy to interpret result, computationally inexpensive
Cons: Poorly models nonlinear data
Works with: Numeric values, nominal values

這邊提的 regression 是指 linear regression, 書上的範例使用的 regression equation 如下 :

HorsePower = 0.0015 * annualSalary - 0.99 * hoursListeningToPublicRadio

上面的 "annualSalary" 與 "hoursListeningToPublicRadio" 便是我們所謂的 feature(s), 而 "0.0015" 與 "0.99" 便是我們要求的 regression weights, 使得計算出來的 "HorsePower" 與實際上的 "HorsePower" 有最接近的結果. 底下為一般處理 regression 的過程 :

General approach to regression
1. Collect: Any metod
2. Prepare: We'll need numeric values for regression.
3. Analyze: It’s helpful to visualized 2D plots. Also, we can visualize the regression weights if we apply shrinkage methods.
4. Train: Find the regression weights.
5. Test: We can measure the R2, or correlation of the predicted value and data, to measure the success of our models.
6. Use: With regression, we can forecast a numeric value for a number of inputs. This is an improvement over classification because we’re predicting a continuous value rather than a discrete category.

接著來看一些數學式子, 首先考慮我們有 training data 在 matrix X ; 而我們要求的是 regression weights 的 vector w. 因此我們可以計算出每一筆 data X1 的計算結果 :

而考慮對應 data xi 的正確結果為 yi, 我們可以如下計算正確結果與公式推導結果的平方差 (squared error). 目標就是找出 regression weights 讓下面的式子的值為最小 (理想是等於0) :

經過適當的數學推導(令上面式子等於0), 可以得到 regression weights vector w 的公式 :

上面 regression weights vector w 帶帽子 "^" 是因為它是基於 Training data 與公式推導出來的最佳解, 不一定代表實際的最佳解 (Training data 只是部份的 data).

接著我們要來看一個簡單範例, 考慮我們有 Training data 的分布如下 (ex0.txt) :

接著我們要來撰寫函數 loadDataSet() 來從 ex0.txt 載入 Training data (regression.py):

view plaincopy to clipboardprint?
#!/usr/local/bin/python  
# -*- coding: utf-8 -*-  
from numpy import *  
  
def loadDataSet(fileName):  
    """ General function to parse tab -delimited floats. """  
    numFeat = len(open(fileName).readline().split('\t')) - 1 #get number of fields  
    dataMat = []; labelMat = []  
    fr = open(fileName)  
    for line in fr.readlines():  
        lineArr =[]  
        curLine = line.strip().split('\t')  
        for i in range(numFeat):  
            lineArr.append(float(curLine[i]))  
        dataMat.append(lineArr)  
        labelMat.append(float(curLine[-1]))  
    return dataMat,labelMat  

可以如下載入 Training data :

>>> import regression
>>> from numpy import *
>>> xArr, yArr = regression.loadDataSet('ex0.txt')
>>> xArr[0:2]
[[1.0, 0.067732000000000001], [1.0, 0.42781000000000002]]

再來是函數 standRegres() 用來計算 regression weights, 因為公式有用到反矩陣, 所以我們必須確定反矩陣存在 linalg.det(xTx) == 0.0 (代表反矩陣不存在), 代碼如下 :

view plaincopy to clipboardprint?
def standRegres(xArr,yArr):  
    xMat = mat(xArr); yMat = mat(yArr).T  
    xTx = xMat.T*xMat  
    if linalg.det(xTx) == 0.0:  
        print "This matrix is singular, cannot do inverse"  
        return  
    ws = xTx.I * (xMat.T*yMat)  
    return ws  

接著我們可以如下計算 regression weights (xArr, yArr 已經載入 Training data):

>>> ws = regression.standRegres(xArr, yArr)
>>> ws
matrix([[ 3.00774324],
[ 1.69532264]])

也就是說我們的 regression equation 會是 :

Y = 1.69532264X + 3.00774324

接著我們可以利用 matlab 將該線性方程式畫出來 :

>>> xMat = mat(xArr)
>>> yMat = mat(yArr)
>>> yHat = xMat * ws # yi = xi*w
>>> import matplotlib.pyplot as plt
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.scatter(xMat[:,1].flatten().A[0], yMat.T[:,0].flatten().A[0])

>>> xCopy = xMat.copy()
>>> xCopy.sort(0)
>>> yHat = xCopy * ws
>>> ax.plot(xCopy[:,1], yHat)
[]
>>> plt.show()

繪出圖形如下 :

Supplement :
* [ ML In Action ] Predicting numeric values : regression - Linear regression (1)
* [ ML In Action ] Predicting numeric values : regression - Linear regression (2)
* [ ML In Action ] Predicting numeric values : regression - Linear regression (3)

程式扎記

標籤

2012年9月3日星期一

[ ML In Action ] Predicting numeric values : regression - Linear regression (1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2012年9月3日 星期一

[ ML In Action ] Predicting numeric values : regression - Linear regression (1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2012年9月3日星期一