2017年3月29日 星期三

[ Intro2ML ] Ch6. Model Evaluation and Improvement - Cross Validation

Introduction 
Having discussed the fundamentals of supervised and unsupervised learning, and having explored a variety of machine learning algorithms, we will now dive more deeply into evaluating models and selecting parameters.We will focus on the supervised methods, regression and classification,as evaluating and selecting models in unsupervised learning is often a very qualitative process (as we saw in Chapter 3). 

To evaluate our supervised models, so far we have split our dataset into a training set and a test set using the train_test_split function, built a model on the training set by calling the fit method, and evaluated it on the test set using the score method, which for classification computes the fraction of correctly classified samples. Here’s an example of that process: 
- ch6_01.py 
  1. from sklearn.datasets import make_blobs  
  2. from sklearn.linear_model import LogisticRegression  
  3. from sklearn.model_selection import train_test_split  
  4.   
  5. # create a synthetic dataset  
  6. X, y = make_blobs(random_state=0)  
  7. # split data and labels into a training and a test set  
  8. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)  
  9. # instantiate a model and fit it to the training set  
  10. logreg = LogisticRegression().fit(X_train, y_train)  
  11. # evaluate the model on the test set  
  12. print("Test set score: {:.2f}".format(logreg.score(X_test, y_test)))  
Output: 
Test set score: 0.88

Remember, the reason we split our data into training and test sets is that we are interested in measuring how well our model generalizes to new, previously unseen data. We are not interested in how well our model fit the training set, but rather in how well it can make predictions for data that was not observed during training

In this chapter, we will expand on two aspects of this evaluation. We will first introduce cross-validation, a more robust way to assess generalization performance, and discuss methods to evaluate classification and regression performance that go beyond the default measures of accuracy and R^2 provided by the score method. We will also discuss grid search, an effective method for adjusting the parameters in supervised models for the best generalization performance. 

Cross-Validation 
Cross-validation is a statistical method of evaluating generalization performance that is more stable and thorough than using a split into a training and a test set. In cross-validation, the data is instead split repeatedly and multiple models are trained. The most commonly used version of cross-validation is k-fold cross-validation, where k is a user-specified number, usually 5 or 10. When performing five-fold cross-validation, the data is first partitioned into five parts of (approximately) equal size, called folds. Next, a sequence of models is trained. The first model is trained using the first fold as the test set, and the remaining folds (2–5) are used as the training set. The model is built using the data in folds 2–5, and then the accuracy is evaluated on fold 1. Then another model is built, this time using fold 2 as the test set and the data in folds 1, 3, 4, and 5 as the training set. This process is repeated using folds 3, 4, and 5 as test sets. For each of these five splits of the data into training and test sets, we compute the accuracy. In the end, we have collected five accuracy values. The process is illustrated in Figure 6-1: 
Figure 6-1. Data splitting in five-fold cross-validation 

Cross-Validation in scikit-learn 
Cross-validation is implemented in scikit-learn using the cross_val_score function from the model_selection module. The parameters of the cross_val_score function are the model we want to evaluate, the training data, and the ground-truth labels. Let’s evaluate LogisticRegression on the iris dataset: 
- ch6_t02.py 
  1. from sklearn.model_selection import cross_val_score  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.linear_model import LogisticRegression  
  4.   
  5. iris = load_iris()  
  6. logreg = LogisticRegression()  
  7.   
  8. scores = cross_val_score(logreg, iris.data, iris.target)  
  9. print("Cross-validation scores: {}".format(scores))  
Output: 
Cross-validation scores: [ 0.96078431 0.92156863 0.95833333]

By default, cross_val_score performs three-fold cross-validation, returning three accuracy values. We can change the number of folds used by changing the cv parameter: 
  1. scores = cross_val_score(logreg, iris.data, iris.target, cv=5)  
  2. print("Cross-validation scores: {}".format(scores))  
A common way to summarize the cross-validation accuracy is to compute the mean: 
  1. print("Average cross-validation score: {:.2f}".format(scores.mean()))  
Output: 
Average cross-validation score: 0.96

Using the mean cross-validation we can conclude that we expect the model to be around 96% accurate on average. Looking at all five scores produced by the five-fold cross-validation, we can also conclude that there is a relatively high variance in the accuracy between folds, ranging from 100% accuracy to 90% accuracy. This could imply that the model is very dependent on the particular folds used for training, but it could also just be a consequence of the small size of the dataset. 

Benefits of Cross-Validation 
There are several benefits to using cross-validation instead of a single split into a training and a test set. First, remember that train_test_split performs a random split of the data. Imagine that we are “lucky” when randomly splitting the data, and all examples that are hard to classify end up in the training set. In that case, the test set will only contain “easy” examples, and our test set accuracy will be unrealistically high. Conversely, if we are “unlucky,” we might have randomly put all the hard-to-classify examples in the test set and consequently obtain an unrealistically low score. However, when using cross-validation, each example will be in the training set exactly once: each example is in one of the folds, and each fold is the test set once. Therefore, the model needs to generalize well to all of the samples in the dataset for all of the cross-validation scores (and their meanto be high. 

Having multiple splits of the data also provides some information about how sensitive our model is to the selection of the training dataset. For the iris dataset, we saw accuracies between 90% and 100%. This is quite a range, and it provides us with an idea about how the model might perform in the worst case and best case scenarios when applied to new data. 

Another benefit of cross-validation as compared to using a single split of the data is that we use our data more effectively. When using train_test_split, we usually use 75% of the data for training and 25% of the data for evaluation. When using five-fold cross-validation, in each iteration we can use four-fifths of the data (80%) to fit the model. When using 10-fold cross-validation, we can use nine-tenths of the data (90%) to fit the model. More data will usually result in more accurate models. The main disadvantage of cross-validation is increased computational cost. As we are now training k models instead of a single model, cross-validation will be roughly k times slower than doing a single split of the data. 
TIP. 
It is important to keep in mind that cross-validation is not a way to build a model that can be applied to new data. Cross-validation does not return a model. When calling cross_val_score, multiple models are built internally, but the purpose of cross-validation is only to evaluate how well a given algorithm will generalize when trained on a specific dataset.

Stratified k-Fold Cross-Validation and Other Strategies 
Splitting the dataset into k folds by starting with the first one-k-th part of the data, as described in the previous section, might not always be a good idea. For example, let’s have a look at the iris dataset: 
  1. from sklearn.datasets import load_iris  
  2. iris = load_iris()  
  3. print("Iris labels:\n{}".format(iris.target))  
Output: 
Iris labels:
[0 0 0 0 0 0 ... 0 1 1 1 1 1 1 ... 2 2]

As you can see, the first third of the data is the class 0, the second third is the class 1, and the last third is the class 2. Imagine doing three-fold cross-validation on this dataset. The first fold would be only class 0, so in the first split of the data, the test set would be only class 0, and the training set would be only classes 1 and 2. As the classes in training and test sets would be different for all three splits, the three-fold cross-validation accuracy would be zero on this dataset. That is not very helpful, as we can do much better than 0% accuracy on iris

As the simple k-fold strategy fails here, scikit-learn does not use it for classification, but rather uses stratified k-fold cross-validation. In stratified cross-validation, we split the data such that the proportions between classes are the same in each fold as they are in the whole dataset, as illustrated in Figure 6-2: 
Figure 6-2. Comparison of standard cross-validation and stratified cross-validation when the data is ordered by class label 

For example, if 90% of your samples belong to class A and 10% of your samples belong to class B, then stratified cross-validation ensures that in each fold, 90% of samples belong to class A and 10% of samples belong to class B. It is usually a good idea to use stratified k-fold cross-validation instead of k-fold cross-validation to evaluate a classifier, because it results in more reliable estimates of generalization performance. In the case of only 10% of samples belonging to class B, using standard k-fold cross-validation it might easily happen that one fold only contains samples of class A. Using this fold as a test set would not be very informative about the overall performance of the classifier. 

For regression, scikit-learn uses the standard k-fold cross-validation by default. It would be possible to also try to make each fold representative of the different values the regression target has, but this is not a commonly used strategy and would be surprising to most users. 

MORE CONTROL OVER CROSS-VALIDATION 
We saw earlier that we can adjust the number of folds that are used in cross_val_score using the cv parameter. However, scikit-learn allows for much finer control over what happens during the splitting of the data by providing a cross-validation splitter as the cv parameter. For most use cases, the defaults of k-fold cross-validation for regression and stratified k-fold for classification work well, but there are some cases where you might want to use a different strategy. Say, for example, we want to use the standard k-fold cross-validation on a classification dataset to reproduce someone else’s results. To do this, we first have to import the KFold splitter class from the model_selectionmodule and instantiate it with the number of folds we want to use: 
  1. from sklearn.model_selection import cross_val_score  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.linear_model import LogisticRegression  
  4. from sklearn.model_selection import KFold  
  5.   
  6. kfold = KFold(n_splits=5)  
Then, we can pass the kfold splitter object as the cv parameter to cross_val_score
  1. iris = load_iris()  
  2. logreg = LogisticRegression()  
  3. cross_val_score(logreg, iris.data, iris.target, cv=kfold)  
  4. print("Cross-validation scores:\n{}".format(cross_val_score(logreg, iris.data, iris.target, cv=kfold)))  
Output: 
Cross-validation scores:
[ 1. 0.93333333 0.43333333 0.96666667 0.43333333]

This way, we can verify that it is indeed a really bad idea to use three-fold (non stratified) cross-validation on the iris dataset: 
  1. kfold = KFold(n_splits=3)  
  2. print("Cross-validation scores:\n{}".format(  
  3.     cross_val_score(logreg, iris.data, iris.target, cv=kfold)))  
Output: 
Cross-validation scores:
[ 0. 0. 0.]

Remember: each fold corresponds to one of the classes in the iris dataset, and so nothing can be learned. Another way to resolve this problem is to shuffle the data instead of stratifying the folds, to remove the ordering of the samples by label. We can do that by setting the shuffle parameter of KFold to True. If we shuffle the data, we also need to fix the random_state to get a reproducible shuffling. Otherwise, each run of cross_val_score would yield a different result, as each time a different split would be used (this might not be a problem, but can be surprising). Shuffling the data before splitting it yields a much better result: 
  1. kfold = KFold(n_splits=3, shuffle=True, random_state=0)  
  2. print("Cross-validation scores:\n{}".format(  
  3.     cross_val_score(logreg, iris.data, iris.target, cv=kfold)))  
Output: 
Cross-validation scores:
[ 0.9 0.96 0.96]

LEAVE-ONE-OUT CROSS-VALIDATION 
Another frequently used cross-validation method is leave-one-out. You can think of leave-one-out cross-validation as k-fold cross-validation where each fold is a single sample. For each split, you pick a single data point to be the test set. This can be very time consuming, particularly for large datasets, but sometimes provides better estimates on small datasets: 
- ch6_t06.py 
  1. from sklearn.model_selection import cross_val_score  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.linear_model import LogisticRegression  
  4. from sklearn.model_selection import KFold  
  5.   
  6. kfold = KFold(n_splits=3, shuffle=True, random_state=0)  
  7. iris = load_iris()  
  8. logreg = LogisticRegression()  
  9. cross_val_score(logreg, iris.data, iris.target, cv=kfold)  
  10. print("Cross-validation scores:\n{}".format(cross_val_score(logreg, iris.data, iris.target, cv=kfold)))  
Output: 
Number of cv iterations: 150
Mean accuracy: 0.95

SHUFFLE-SPLIT CROSS-VALIDATION 
Another, very flexible strategy for cross-validation is shuffle-split cross-validation. In shuffle-split cross-validation, each split samples train_size many points for the training set and test_size many (disjoint) point for the test set. This splitting is repeated n_iter times. Figure 6-3 illustrates running four iterations of splitting a dataset consisting of 10 points, with a training set of 5 points and test sets of 2 points each (you can use integers for train_size and test_size to use absolute sizes for these sets, or floating-point numbers to use fractions of the whole dataset): 
Figure 6-3. ShuffleSplit with 10 points, train_size=5, test_size=2, and n_iter=4 

The following code splits the dataset into 50% training set and 50% test set for 10 iterations: 
- ch6_t07.py 
  1. from sklearn.model_selection import cross_val_score  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.linear_model import LogisticRegression  
  4. from sklearn.model_selection import ShuffleSplit  
  5.   
  6. shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10)  
  7. iris = load_iris()  
  8. logreg = LogisticRegression()  
  9. scores = cross_val_score(logreg, iris.data, iris.target, cv=shuffle_split)  
  10. print("Cross-validation scores:\n{}".format(scores))  
  11. print("Mean accuracy: {:.2f}".format(scores.mean()))  
Output: 
Cross-validation scores:
[ 0.97333333 0.97333333 0.98666667 0.94666667 0.96 ... 0.93333333 0.86666667]

Shuffle-split cross-validation allows for control over the number of iterations independently of the training and test sizes, which can sometimes be helpful. It also allows for using only part of the data in each iteration, by providing train_size and test_size settings that don’t add up to one. Subsampling the data in this way can be useful for experimenting with large datasets. There is also a stratified variant of ShuffleSplit, aptly named StratifiedShuffleSplit, which can provide more reliable results for classification tasks. 

CROSS-VALIDATION WITH GROUPS 
Another very common setting for cross-validation is when there are groups in the data that are highly related. Say you want to build a system to recognize emotions from pictures of faces, and you collect a dataset of pictures of 100 people where each person is captured multiple times, showing various emotions. The goal is to build a classifier that can correctly identify emotions of people not in the dataset. You could use the default stratified cross-validation to measure the performance of a classifier here. However, it is likely that pictures of the same person will be in both the training and the test set. It will be much easier for a classifier to detect emotions in a face that is part of the training set, compared to a completely new face. To accurately evaluate the generalization to new faces, we must therefore ensure that the training and test sets contain images of different people. 

To achieve this, we can use GroupKFold, which takes an array of groups as argument that we can use to indicate which person is in the image. The groups array here indicates groups in the data that should not be split when creating the training and test sets, and should not be confused with the class label. This example of groups in the data is common in medical applications, where you might have multiple samples from the same patient, but are interested in generalizing to new patients. Similarly, in speech recognition, you might have multiple recordings of the same speaker in your dataset, but are interested in recognizing speech of new speakers. 

The following is an example of using a synthetic dataset with a grouping given by the groups array. The dataset consists of 12 data points, and for each of the data points, groups specifies which group (think patient) the point belongs to. The groups specify that there are four groups, and the first three samples belong to the first group, the next four samples belong to the second group, and so on: 
- ch6_t08.py 
  1. from sklearn.model_selection import cross_val_score  
  2. from sklearn.datasets import make_blobs  
  3. from sklearn.linear_model import LogisticRegression  
  4. from sklearn.model_selection import GroupKFold  
  5. # create synthetic dataset  
  6. X, y = make_blobs(n_samples=12, random_state=0)  
  7. # assume the first three samples belong to the same group,  
  8. # then the next four, etc.  
  9. groups = [000111122333]  
  10. logreg = LogisticRegression()  
  11. scores = cross_val_score(logreg, X, y, groups, cv=GroupKFold(n_splits=3))  
  12. print("Cross-validation scores:\n{}".format(scores))  
Output: 
Cross-validation scores:
[ 0.75 0.8 0.66666667]

The samples don’t need to be ordered by group; we just did this for illustration purposes. The splits that are calculated based on these labels are visualized in Figure 6-4. As you can see, for each split, each group is either entirely in the training set or entirely in the test set: 
Figure 6-4. Label-dependent splitting with GroupKFold 

There are more splitting strategies for cross-validation in scikit-learn, which allow for an even greater variety of use cases (you can find these in the scikit-learn user guide). However, the standard KFoldStratifiedKFold, and GroupKFold are by far the most commonly used ones. 

Supplement 
Selecting the best model in scikit-learn using cross-validation

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...