2016年12月14日 星期三

[ Scikit- learn ] Training a machine learning model with scikit-learn

Source From Here
Preface

Agenda
* What is the K-nearest neighbors classification model?
* What are the four steps for model training and prediction in scikit-learn?
* How can I apply this pattern to other machine learning models?

Reviewing the iris dataset
* 150 observationss
* 4 features (sepal length, sepal width, petal length, petal width)
* Response variable is the iris species
* Classification problem since response is categorical

How To Use Scikit-learn to train model
Loading the Data
>>> from sklearn.datasets import load_iris
>>> iris = load_iris()
>>> type(iris)

>>> X = iris.data # Store feature matrix in 'X'
>>> y = iris.target # Store response vector in 'y'
>>> print X.shape
(150, 4)
>>> print y.shape
(150,)


scikit-learn 4-step modeling pattern
Step1: Import the class you plan to use
>>> from sklearn.neighbors import KNeighborsClassifier

Step2: "Instantiate" the "estimator" (Here is KNeighborsClassifier)
// n_neighbors: Number of neighbors to use by default for k_neighbors queries.
>>> knn = KNeighborsClassifier(n_neighbors=1)
>>> knn
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform')

Step3: Fit the model with data (aka "Model training")
* Model is learning the relationship between X and y
* Occurs in-place
>>> knn.fit(X, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=1, p=2,
weights='uniform')

Step4: Predict the response for a new observation
* New observations are called "out-of-sample" data
* Use the information it learned during the model training process
>>> knn.predict([3, 5, 4, 2])
...
array([2])

* Return a NumPy array
* Can predict for multiple observations at once
>>> X_new = [[3,5,4,2], [5,4,3,2]]
>>> knn.predict(X_new)
array([2, 1])

Using a different value for K
>>> knn = KNeighborsClassifier(n_neighbors=5)
>>> knn.fit(X, y)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')

>>> knn.predict(X_new)
array([1, 1])


Using a different classification model
Consistent APIs make you easily to use other model relatively easily. Below will use LogisticRegression instead:
>>> from sklearn.linear_model import LogisticRegression
>>> logreg = LogisticRegression() # Instaniate the model
>>> logreg.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

>>> logreg.predict(X_new)
array([2, 0])



Supplement
Previous section - Getting started in scikit-learn with the famous iris dataset
Next section - Comparing machine learning models in scikit-learn
Supervised Learning - 1.6 Nearest Neighbors
1.1.11. Logistic regression
In-depth introduction to machine learning in 15 hours of expert videos


沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...