Review of model evaluation procedures
Steps for K-fold cross-validation
Below is the sample code to demonstrate the K-fold process:
Comparing cross-validation to train/test split
Advantages of cross-validation:
Advantages of train/test split:
1. K can be any number, but K=10 is generally recommended
2. For classification problems, stratified sampling (Keep same portion of each class in each training/testing set) is recommended for creating the folds. (scikit-learn's cross_val_score function does this by default.)
Cross-validation example: parameter tuning
Goal: Select the best tuning parameters (aka 'hyperparameters') for KNN on the iris dataset
As expectation that the best parameters selection will not in the lowest K (Underfitting/High bias) neither the largest K (Overfitting/High variance). Here we will select K=20 for the follow up sections.
Cross-validation example: model selection
Goal: Compare the best KNN model with Logistic regression on the iris dataset
The result shows that KNN model out-performance Linear regression model.
Cross-validation example: feature selection
You can notice that the feature selection without 'Newspaper' is a slightly better than all features selection (with smaller RMSE).
Improvements to cross-validation
Creating a hold-out set
Feature engineering and selection within cross-validation iterations
* Prev - Data science in Python: pandas, seaborn, scikit-learn
* Next - How to find the best model parameters in scikit-learn