Preface
Agenda
Review of model evaluation procedures
Motivation:
Initial idea:
Alternative idea:
Let's code:
K-fold cross-validation
Steps for K-fold cross-validation
Below is the sample code to demonstrate the K-fold process:
- demo_kfolds.py
- #!/usr/bin/env python
- from sklearn.cross_validation import KFold
- # Simulate splitting a dataset of 25 observations into 5 folds
- kf = KFold(25, n_folds=5, shuffle=False)
- # Print the contens of each training and testing set
- print '{} {:^61} {}'.format('Iteration', 'Training set observations', 'Testing set observation')
- for iteration, data in enumerate(kf, start=1):
- print '{:^9} {} {:^25}'.format(iteration, data[0], data[1])
Comparing cross-validation to train/test split
Advantages of cross-validation:
Advantages of train/test split:
Cross-validation recommendations
1. K can be any number, but K=10 is generally recommended
2. For classification problems, stratified sampling (Keep same portion of each class in each training/testing set) is recommended for creating the folds. (scikit-learn's cross_val_score function does this by default.)
Cross-validation example: parameter tuning
Goal: Select the best tuning parameters (aka 'hyperparameters') for KNN on the iris dataset
- select_params.py
- #!/usr/bin/env python
- from sklearn.datasets import load_iris
- from sklearn.cross_validation import train_test_split
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn import metrics
- iris = load_iris() # Load the iris dataset
- X = iris.data
- y = iris.target
- from sklearn.cross_validation import cross_val_score
- k_range = range(1, 31)
- k_scores = []
- for k in k_range:
- knn = KNeighborsClassifier(n_neighbors=k)
- scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')
- k_scores.append(scores.mean())
- import matplotlib.pyplot as plt
- plt.plot(k_range, k_scores)
- plt.xlabel('value of K for KNN')
- plt.ylabel('Cross-Validation Accuracy')
- plt.show()
As expectation that the best parameters selection will not in the lowest K (Underfitting/High bias) neither the largest K (Overfitting/High variance). Here we will select K=20 for the follow up sections.
Cross-validation example: model selection
Goal: Compare the best KNN model with Logistic regression on the iris dataset
- compare_model.py
- #!/usr/bin/env python
- from sklearn.datasets import load_iris
- from sklearn.cross_validation import train_test_split
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn import metrics
- iris = load_iris() # Load the iris dataset
- X = iris.data
- y = iris.target
- from sklearn.cross_validation import cross_val_score
- # Choose the best K=20 according to previous experiment
- knn = KNeighborsClassifier(n_neighbors=20)
- # 10-fold cross-validation with the best KNN model
- print "KNN(K=20) with accuracy mean under 10 fold = %.02f" % (cross_val_score(knn, X, y, cv=10, scoring='accuracy').mean())
- from sklearn.linear_model import LogisticRegression
- logreg = LogisticRegression()
- print "Logistic Regression with accuracy mean under 10 fold = %.02f" % (cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean())
The result shows that KNN model out-performance Linear regression model.
Cross-validation example: feature selection
- select_features.py
- #!/usr/bin/env python
- from sklearn.datasets import load_iris
- from sklearn.cross_validation import train_test_split
- from sklearn.neighbors import KNeighborsClassifier
- from sklearn import metrics
- from sklearn.cross_validation import cross_val_score
- import pandas as pd
- import numpy as np
- from sklearn.linear_model import LinearRegression
- # Read in the advertising dataset
- data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv')
- # Create a Python list of three feature names
- feature_cols = ['TV', 'Radio', 'Newspaper']
- # Use the list to select a subset of the DataFrame (X)
- X = data[feature_cols]
- # Select the Sales column as the response
- y = data.Sales
- # 10-fold cross-validation with all three features
- lm = LinearRegression()
- scores = cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')
- # fix the sign of MSE scores
- mse_scores = -scores
- # Show the mean RMSE score
- rmse_scores = np.sqrt(mse_scores)
- print "The RMSE scores with full features selection = %.02f" % (rmse_scores.mean())
- feature_cols = ['TV', 'Radio']
- X = data[feature_cols]
- print "The RMSE scores with features selection (Without 'Newspaper') = %.02f" % (np.sqrt(-cross_val_score(lm, X, y, cv=10, scoring='mean_squared_error')).mean())
You can notice that the feature selection without 'Newspaper' is a slightly better than all features selection (with smaller RMSE).
Improvements to cross-validation
Repeated cross-validation
Creating a hold-out set
Feature engineering and selection within cross-validation iterations
Supplement
* Prev - Data science in Python: pandas, seaborn, scikit-learn
* Next - How to find the best model parameters in scikit-learn
沒有留言:
張貼留言