程式扎記: [ Scikit- learn ] Selecting the best model in scikit-learn using cross-validation

標籤

2016年12月25日 星期日

[ Scikit- learn ] Selecting the best model in scikit-learn using cross-validation

Source From Here 
Preface 

Agenda 
* Waht is the drawback of using the train/test split procedure for model evaluation?
* How does K-fold cross-validation overcome this limitation
* How can cross-validation be used for selecting tuning parameters, choosing between models, and selecting features?
* What are some possible improvements to cross-validation?


Review of model evaluation procedures 
Motivation: 
Need a way to choose between machine learning models. Goal is to estimate likely performance of a model on out-of-sample data

Initial idea: 
Train and test on the same data. But maximizing training accuracy rewards overly complex models with overfit the training data.

Alternative idea: 
Train/Test split which
* Split the dataset into two pieces, so t hat the model can be trained and tested on different data
* Testing accuracy is better estimate than training accuracy of out-of-sample performance
* But, it provides a high variance estimate since changing which observations happen to be in the testing set can significantly change testing accuracy

Let's code: 
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.neighbors import KNeighborsClassifier
>>> from sklearn import metrics
>>> iris = load_iris() # Load the iris dataset
>>> X = iris.data
>>> y = iris.target
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4) # Different random_state will have different accuracy score which means high variance
>>> knn = KNeighborsClassifier(n_neighbors=5)
>>> knn.fit(X_train, y_train) // Train the model
>>> y_pred = knn.predict(X_test)
>>> print metrics.accuracy_score(y_test, y_pred)
0.973684210526

K-fold cross-validation 
Steps for K-fold cross-validation 
1. Split the dataset into K equal partitions (or "folds")
2. Use fold 1 as the testing set and the union of the other folds as the training set.
3. Calculate testing accuracy.
4. Repeat steps 2 and 3 K times, using a different fold as the testing set each time.
5. Use the average testing accuracy as the estimate of out-of-sample accuracy.

Diagram of 5-fold cross-validation: 


Below is the sample code to demonstrate the K-fold process: 
- demo_kfolds.py 
  1. #!/usr/bin/env python  
  2. from sklearn.cross_validation import KFold  
  3. # Simulate splitting a dataset of 25 observations into 5 folds  
  4. kf = KFold(25, n_folds=5, shuffle=False)  
  5.   
  6. # Print the contens of each training and testing set  
  7. print '{} {:^61} {}'.format('Iteration''Training set observations''Testing set observation')  
  8. for iteration, data in enumerate(kf, start=1):  
  9.     print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])  
The output looks like: 

Comparing cross-validation to train/test split 
Advantages of cross-validation: 
* More accurate estimate of out-of-sample accuracy
* More "efficient" use of data (every observation is used for both training and testing)

Advantages of train/test split: 
* Runs K times faster than K-fold cross-validation
* Simpler to examine the detailed results of testing process[/color])

Cross-validation recommendations 
1. K can be any number, but K=10 is generally recommended 
2. For classification problems, stratified sampling (Keep same portion of each class in each training/testing set) is recommended for creating the folds. (scikit-learn's cross_val_score function does this by default.) 

Cross-validation example: parameter tuning 
Goal: Select the best tuning parameters (aka 'hyperparameters') for KNN on the iris dataset 
- select_params.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.cross_validation import train_test_split  
  4. from sklearn.neighbors import KNeighborsClassifier  
  5. from sklearn import metrics  
  6.   
  7. iris = load_iris() # Load the iris dataset  
  8. X = iris.data  
  9. y = iris.target  
  10.   
  11. from sklearn.cross_validation import cross_val_score  
  12. k_range = range(131)  
  13. k_scores = []  
  14. for k in k_range:  
  15.     knn = KNeighborsClassifier(n_neighbors=k)  
  16.     scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  
  17.     k_scores.append(scores.mean())  
  18.   
  19. import matplotlib.pyplot as plt  
  20. plt.plot(k_range, k_scores)  
  21. plt.xlabel('value of K for KNN')  
  22. plt.ylabel('Cross-Validation Accuracy')  
  23. plt.show()   
The execution look likes: 

As expectation that the best parameters selection will not in the lowest K (Underfitting/High bias) neither the largest K (Overfitting/High variance). Here we will select K=20 for the follow up sections. 

Cross-validation example: model selection 
Goal: Compare the best KNN model with Logistic regression on the iris dataset 
- compare_model.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.cross_validation import train_test_split  
  4. from sklearn.neighbors import KNeighborsClassifier  
  5. from sklearn import metrics  
  6.   
  7. iris = load_iris() # Load the iris dataset  
  8. X = iris.data  
  9. y = iris.target  
  10.   
  11. from sklearn.cross_validation import cross_val_score  
  12.   
  13. # Choose the best K=20 according to previous experiment  
  14. knn = KNeighborsClassifier(n_neighbors=20)  
  15. 10-fold cross-validation with the best KNN model  
  16. print "KNN(K=20) with accuracy mean under 10 fold = %.02f" % (cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())  
  17.   
  18. from sklearn.linear_model import LogisticRegression  
  19. logreg = LogisticRegression()  
  20. print "Logistic Regression with accuracy mean under 10 fold = %.02f" % (cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())  
The execution result: 
KNN(K=20) with accuracy mean under 10 fold = 0.98
Logistic Regression with accuracy mean under 10 fold = 0.95

The result shows that KNN model out-performance Linear regression model. 

Cross-validation example: feature selection 
- select_features.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.cross_validation import train_test_split  
  4. from sklearn.neighbors import KNeighborsClassifier  
  5. from sklearn import metrics  
  6. from sklearn.cross_validation import cross_val_score  
  7. import pandas as pd  
  8. import numpy as np  
  9. from sklearn.linear_model import LinearRegression  
  10. # Read in the advertising dataset  
  11. data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')  
  12.   
  13. # Create a Python list of three feature names  
  14. feature_cols = ['TV''Radio''Newspaper']  
  15.   
  16. # Use the list to select a subset of the DataFrame (X)  
  17. X = data[feature_cols]  
  18.   
  19. # Select the Sales column as the response  
  20. y = data.Sales  
  21.   
  22. 10-fold cross-validation with all three features  
  23. lm = LinearRegression()  
  24. scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')  
  25.   
  26. # fix the sign of MSE scores  
  27. mse_scores = -scores  
  28.   
  29. # Show the mean RMSE score  
  30. rmse_scores = np.sqrt(mse_scores)  
  31. print "The RMSE scores with full features selection = %.02f" % (rmse_scores.mean())  
  32.   
  33. feature_cols = ['TV''Radio']  
  34. X = data[feature_cols]  
  35. print "The RMSE scores with features selection (Without 'Newspaper') = %.02f" % (np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean())  
The execution output: 
The RMSE scores with full features selection = 1.69
The RMSE scores with features selection (Without 'Newspaper') = 1.68

You can notice that the feature selection without 'Newspaper' is a slightly better than all features selection (with smaller RMSE). 

Improvements to cross-validation 
Repeated cross-validation 
* Repeat cross-validation multiple times (with different random splits of the data) and average the results
* More reliable estimate of out-of-sample performance by reducing the variance associated with a single trial of cross-validation.

Creating a hold-out set 
* "Hold out" a portion of the data before beginning the model building process
* Locate the best model using cross-validation on the remaining data, and test it using the hold-out set.
* More reliable estimate of out-of-sample performance since hold-out set is truly out-of-sample

Feature engineering and selection within cross-validation iterations 
* Normally, feature engineering and selection occurs before cross-validation
* Instead, perform all feature engineering and selection within each cross-validation iteration
* More reliable estimate of out-of-sample performance since it better mimics the application of the model to out-of-sample data

Supplement 
* Prev - Data science in Python: pandas, seaborn, scikit-learn 
* Next - How to find the best model parameters in scikit-learn

沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!