程式扎記: [ Scikit- learn ] Selecting the best model in scikit-learn using cross-validation

Source From Here
Preface

Agenda

* Waht is the drawback of using the train/test split procedure for model evaluation?
* How does K-fold cross-validation overcome this limitation
* How can cross-validation be used for selecting tuning parameters, choosing between models, and selecting features?
* What are some possible improvements to cross-validation?

Review of model evaluation procedures
Motivation:

Need a way to choose between machine learning models. Goal is to estimate likely performance of a model on out-of-sample data

Initial idea:

Train and test on the same data. But maximizing training accuracy rewards overly complex models with overfit the training data.

Alternative idea:

Train/Test split which
* Split the dataset into two pieces, so t hat the model can be trained and tested on different data
* Testing accuracy is better estimate than training accuracy of out-of-sample performance
* But, it provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy

Let's code:

>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn import metrics
>>> iris = load_iris() # Load the iris dataset
>>> X = iris.data
>>> y = iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4) # Different random_state will have different accuracy score which means high variance
>>> knn = KNeighborsClassifier(n_neighbors=5)
>>> knn.fit(X_train, y_train) // Train the model
>>> y_pred = knn.predict(X_test)
>>> print metrics.accuracy_score(y_test, y_pred)
0.973684210526

K-fold cross-validation
Steps for K-fold cross-validation

1. Split the dataset into K equal partitions (or "folds")
2. Use fold 1 as the testing set and the union of the other folds as the training set.
3. Calculate testing accuracy.
4. Repeat steps 2 and 3 K times, using a different fold as the testing set each time.
5. Use the average testing accuracy as the estimate of out-of-sample accuracy.

Diagram of 5-fold cross-validation:

Below is the sample code to demonstrate the K-fold process:
- demo_kfolds.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
from sklearn.cross_validation import KFold  
# Simulate splitting a dataset of 25 observations into 5 folds  
kf = KFold(25, n_folds=5, shuffle=False)  
  
# Print the contens of each training and testing set  
print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observation')  
for iteration, data in enumerate(kf, start=1):  
    print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])  

The output looks like:

Comparing cross-validation to train/test split
Advantages of cross-validation:

* More accurate estimate of out-of-sample accuracy
* More "efficient" use of data (every observation is used for both training and testing)

Advantages of train/test split:

* Runs K times faster than K-fold cross-validation
* Simpler to examine the detailed results of testing process[/color])

Cross-validation recommendations
1. K can be any number, but K=10 is generally recommended
2. For classification problems, stratified sampling (Keep same portion of each class in each training/testing set) is recommended for creating the folds. (scikit-learn's cross_val_score function does this by default.)

Cross-validation example: parameter tuning
Goal: Select the best tuning parameters (aka 'hyperparameters') for KNN on the iris dataset
- select_params.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
from sklearn.datasets import load_iris  
from sklearn.cross_validation import train_test_split  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn import metrics  
  
iris = load_iris() # Load the iris dataset  
X = iris.data  
y = iris.target  
  
from sklearn.cross_validation import cross_val_score  
k_range = range(1, 31)  
k_scores = []  
for k in k_range:  
    knn = KNeighborsClassifier(n_neighbors=k)  
    scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  
    k_scores.append(scores.mean())  
  
import matplotlib.pyplot as plt  
plt.plot(k_range, k_scores)  
plt.xlabel('value of K for KNN')  
plt.ylabel('Cross-Validation Accuracy')  
plt.show()   

The execution look likes:

As expectation that the best parameters selection will not in the lowest K (Underfitting/High bias) neither the largest K (Overfitting/High variance). Here we will select K=20 for the follow up sections.

Cross-validation example: model selection
Goal: Compare the best KNN model with Logistic regression on the iris dataset
- compare_model.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
from sklearn.datasets import load_iris  
from sklearn.cross_validation import train_test_split  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn import metrics  
  
iris = load_iris() # Load the iris dataset  
X = iris.data  
y = iris.target  
  
from sklearn.cross_validation import cross_val_score  
  
# Choose the best K=20 according to previous experiment  
knn = KNeighborsClassifier(n_neighbors=20)  
# 10-fold cross-validation with the best KNN model  
print "KNN(K=20) with accuracy mean under 10 fold = %.02f" % (cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())  
  
from sklearn.linear_model import LogisticRegression  
logreg = LogisticRegression()  
print "Logistic Regression with accuracy mean under 10 fold = %.02f" % (cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())  

The execution result:

KNN(K=20) with accuracy mean under 10 fold = 0.98
Logistic Regression with accuracy mean under 10 fold = 0.95

The result shows that KNN model out-performance Linear regression model.

Cross-validation example: feature selection
- select_features.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
from sklearn.datasets import load_iris  
from sklearn.cross_validation import train_test_split  
from sklearn.neighbors import KNeighborsClassifier  
from sklearn import metrics  
from sklearn.cross_validation import cross_val_score  
import pandas as pd  
import numpy as np  
from sklearn.linear_model import LinearRegression  
# Read in the advertising dataset  
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')  
  
# Create a Python list of three feature names  
feature_cols = ['TV', 'Radio', 'Newspaper']  
  
# Use the list to select a subset of the DataFrame (X)  
X = data[feature_cols]  
  
# Select the Sales column as the response  
y = data.Sales  
  
# 10-fold cross-validation with all three features  
lm = LinearRegression()  
scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')  
  
# fix the sign of MSE scores  
mse_scores = -scores  
  
# Show the mean RMSE score  
rmse_scores = np.sqrt(mse_scores)  
print "The RMSE scores with full features selection = %.02f" % (rmse_scores.mean())  
  
feature_cols = ['TV', 'Radio']  
X = data[feature_cols]  
print "The RMSE scores with features selection (Without 'Newspaper') = %.02f" % (np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean())  

The execution output:

The RMSE scores with full features selection = 1.69
The RMSE scores with features selection (Without 'Newspaper') = 1.68

You can notice that the feature selection without 'Newspaper' is a slightly better than all features selection (with smaller RMSE).

Improvements to cross-validation
Repeated cross-validation

* Repeat cross-validation multiple times (with different random splits of the data) and average the results
* More reliable estimate of out-of-sample performance by reducing the variance associated with a single trial of cross-validation.

Creating a hold-out set

* "Hold out" a portion of the data before beginning the model building process
* Locate the best model using cross-validation on the remaining data, and test it using the hold-out set.
* More reliable estimate of out-of-sample performance since hold-out set is truly out-of-sample

Feature engineering and selection within cross-validation iterations

* Normally, feature engineering and selection occurs before cross-validation
* Instead, perform all feature engineering and selection within each cross-validation iteration
* More reliable estimate of out-of-sample performance since it better mimics the application of the model to out-of-sample data

Supplement
* Prev - Data science in Python: pandas, seaborn, scikit-learn
* Next - How to find the best model parameters in scikit-learn

程式扎記

標籤

2016年12月25日星期日

[ Scikit- learn ] Selecting the best model in scikit-learn using cross-validation

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2016年12月25日 星期日