程式扎記: [ Scikit- learn ] How to find the best model parameters in scikit-learn

Source From Here
Preface

Agenda

* How can K-fold cross-validation be used to search for an optimal tuning parameter?
* How can this process be made more efficient?
* How do you search for multiple tuning parameters at once?
* What do you do with those tuning parameters before making real predictions?
* How can the computational expense of this process be reduced

Review of K-fold cross-validation
Steps for cross-validation:

* Dataset is split into K "folds" of equal size
* Each fold acts as the testing set 1 time, and acts as the training set K-1 times
* Average testing performance is used as the estimate of out-of-sample performance

Benefits of cross-validation:

* More reliable estimate of out-of-sample performance than train/test split
* Can be used for selecting tuning parameters, choosing between models, and selection of features.

Drawback of cross-validation:

* Can be computationally expensive

Review of parameter tuning using cross_val_score
Goal: Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset
- test1.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
from sklearn.datasets import load_iris  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.cross_validation import cross_val_score  
import matplotlib.pyplot as plt  
  
# Read in the iris data  
iris = load_iris()  
  
# Create X(features) and y(response)  
X = iris.data  
y = iris.target  
  
# 10-fold cross-validation with K=5 for KNN  
knn = KNeighborsClassifier(n_neighbors=5)  
scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  
print "KNN(n=5) with accuracy=%.02f on iris dataset." % scores.mean()  
  
# Search for an optimal value of K for KNN  
k_range = range(1, 31)  
k_scores = []  
for k in k_range:  
    knn =  KNeighborsClassifier(n_neighbors=k)  
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  
    k_scores.append(scores.mean())  
  
plt.plot(k_range, k_scores)  
plt.xlabel("Value of K for KNN")  
plt.ylabel("Cross-Validated Accruacy")  
plt.show()  

Execution result:

More efficient parameter tuning using GridSearchCV
- test2.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
from sklearn.datasets import load_iris  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.cross_validation import cross_val_score  
import matplotlib.pyplot as plt  
from sklearn.grid_search import GridSearchCV  
  
# Read in the iris data  
iris = load_iris()  
  
# Create X(features) and y(response)  
X = iris.data  
y = iris.target  
  
# 10-fold cross-validation with K=5 for KNN  
knn = KNeighborsClassifier(n_neighbors=5)  
  
# Define the parameter values that should be searched  
k_range = range(1, 31)  
  
# Create a parameter grid: map the parameter names to the values that should be searched  
param_grid = dict(n_neighbors=k_range)  
  
# Instantiate the grid  
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')  
  
# fit the grid with data  
grid.fit(X, y)  
  
# Examine the first tuple  
print "Tuple0 using parameter=%s" % grid.grid_scores_[0].parameters  
print "Tuple0 scores of 10-fold CV scores:\n%s\n" % grid.grid_scores_[0].cv_validation_scores  
print "Tuple0 with mean of 10-fold CV score=%.02f" % grid.grid_scores_[0].mean_validation_score  
  
# Create a list of the mean scores only  
grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]  
  
# Plot the results  
#import matplotlib.pyplot as plt  
#plt.plot(k_range, grid_mean_scores)  
#plt.xlabel("Value of K for KNN")  
#plt.ylabel("Cross-Validated Accuracy")  
#plt.show()  
  
# Examine the best model  
print "Best score=%.02f" % grid.best_score_  
print "Best param=%s" % grid.best_params_  
print "Best etimator:\n%s\n" % grid.best_estimator_  

What if we have more than one parameters to do optimization?

Searching multiple parameters simultaneously

* Example: tuning max_depth and min_samples_leaf for DecisionTreeClassifier
* Could tune parameters independently: change max_depth while leaving min_sample_leaf at default value, and vice versa
* But, best performance might be achieved when neighbor parameter is at its default value

- test3.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
from sklearn.datasets import load_iris  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.cross_validation import cross_val_score  
import matplotlib.pyplot as plt  
from sklearn.grid_search import GridSearchCV  
  
# Read in the iris data  
iris = load_iris()  
  
# Create X(features) and y(response)  
X = iris.data  
y = iris.target  
  
# 10-fold cross-validation with K=5 for KNN  
knn = KNeighborsClassifier(n_neighbors=5)  
  
# define the parameter values that should be searched  
k_range = range(1, 31)  
weight_options = ['uniform', 'distance']  
  
# Create a parameter grid: map the parameter names to the values that should be searched  
param_grid = dict(n_neighbors=k_range, weights=weight_options)  
  
# Instantiate and fit the grid  
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')  
grid.fit(X, y)  
  
# Examine the first tuple  
print "Tuple0 using parameter=%s" % grid.grid_scores_[0].parameters  
print "Tuple0 scores of 10-fold CV scores:\n%s\n" % grid.grid_scores_[0].cv_validation_scores  
print "Tuple0 with mean of 10-fold CV score=%.02f" % grid.grid_scores_[0].mean_validation_score  
  
# Create a list of the mean scores only  
grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]  
  
# Plot the results  
#import matplotlib.pyplot as plt  
#plt.plot(k_range, grid_mean_scores)  
#plt.xlabel("Value of K for KNN")  
#plt.ylabel("Cross-Validated Accuracy")  
#plt.show()  
  
# Examine the best model  
print "Best score=%.02f" % grid.best_score_  
print "Best param=%s" % grid.best_params_  
print "Best etimator:\n%s\n" % grid.best_estimator_  

Using the best parameters to make predictions

view plaincopy to clipboardprint?
# train your model using all data and the best known parameters  
knn = KNeighborsClassifier(n_neighbors=13, weights='uniform')  
knn.fit(X, y)  
  
# make a prediction on out-of-sample data  
knn.predict([3, 5, 4, 2])  
  
# Shortcut: GridSearchCV automatically refits the best model using all of the data  
grid.predict([3, 5, 4, 2])  

Reducing Computational Expense Using RandomizedSearchCV
* Searching many different parameters at once may be computationally infeasible.
* RandomizedSearchCV searches a subset of the parameters, and you control the computational "budget"
- test4.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
from sklearn.datasets import load_iris  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn.cross_validation import cross_val_score  
import matplotlib.pyplot as plt  
from sklearn.grid_search import RandomizedSearchCV  
  
# Read in the iris data  
iris = load_iris()  
  
# Create X(features) and y(response)  
X = iris.data  
y = iris.target  
  
# 10-fold cross-validation with K=5 for KNN  
knn = KNeighborsClassifier(n_neighbors=5)  
  
# define the parameter values that should be searched  
k_range = range(1, 31)  
weight_options = ['uniform', 'distance']  
  
# Specify "parameter ditributions" rather than a "parameter grid"  
param_dist = dict(n_neighbors=k_range, weights=weight_options)  
  
# n_iter controls the number of searches  
rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5)  
rand.fit(X, y)  
print rand.grid_scores_  
  
# Examine the best model  
print "Best score=%.02f" % rand.best_score_  
print "Best param=%s" % rand.best_params_  
print "Best etimator:\n%s\n" % rand.best_estimator_  
  
# Run RandomizedSearchCV 20 times (with n_iter=10) and record the best score  
best_scores = []  
for i in range(20):  
    rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10)  
    rand.fit(X, y)  
    best_scores.append(round( rand.best_score_, 3))  
print "Best scores collect in 20 times:\n%s\n" % best_scores  

Supplement
* Grid search user guide: http://scikit-learn.org/stable/module...
* GridSearchCV documentation: http://scikit-learn.org/stable/module...
* RandomizedSearchCV documentation: http://scikit-learn.org/stable/module...
* Comparing randomized search and grid search: http://scikit-learn.org/stable/auto_e...
* Randomized search video: https://youtu.be/0wUF_Ov8b0A?t=17m38s
* Randomized search notebook: http://nbviewer.ipython.org/github/am...
* Random Search for Hyper-Parameter Optimization (paper): http://www.jmlr.org/papers/volume13/b...

程式扎記

標籤

2016年12月29日星期四

[ Scikit- learn ] How to find the best model parameters in scikit-learn

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2016年12月29日 星期四