2017年3月30日 星期四

[ Intro2ML ] Ch6. Model Evaluation and Improvement - Grid Search

Introduction 
Now that we know how to evaluate how well a model generalizes, we can take the next step and improve the model’s generalization performance by tuning its parameters. We discussed the parameter settings of many of the algorithms in scikit-learn in Chapters 2 and 3, and it is important to understand what the parameters mean before trying to adjust them. Finding the values of the important parameters of a model (the ones that provide the best generalization performance) is a tricky task, but necessary for almost all models and datasets. Because it is such a common task, there are standard methods in scikit-learn to help you with it. The most commonly used method is grid search, which basically means trying all possible combinations of the parameters of interest. 

Consider the case of a kernel SVM with an RBF (radial basis function) kernel, as implemented in the SVC class. As we discussed in Chapter 2, there are two important parameters: the kernel bandwidth, gamma, and the regularization parameter, C. Say we want to try the values 0.001, 0.01, 0.1, 1, 10, and 100 for the parameter C, and the same for gamma. Because we have six different settings for C and gamma that we want to try, we have 36 combinations of parameters in total. Looking at all possible combinations creates a table (or grid) of parameter settings for the SVM, as shown here: 


Simple Grid Search 
We can implement a simple grid search just as for loops over the two parameters, training and evaluating a classifier for each combination: 
- ch6_t09.py 
  1. from sklearn.model_selection import cross_val_score  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.svm import SVC  
  4. from sklearn.model_selection import train_test_split  
  5.   
  6. iris = load_iris()  
  7. X_train, X_test, y_train, y_test = train_test_split(  
  8.     iris.data, iris.target, random_state=0)  
  9. print("Size of training set: {}   size of test set: {}".format(X_train.shape[0], X_test.shape[0]))  
  10.   
  11. best_score = 0  
  12. for gamma in [0.0010.010.1110100]:  
  13.     for C in [0.0010.010.1110100]:  
  14.         # for each combination of parameters, train an SVC  
  15.         svm = SVC(gamma=gamma, C=C)  
  16.         svm.fit(X_train, y_train)  
  17.         # evaluate the SVC on the test set  
  18.         score = svm.score(X_test, y_test)  
  19.         # if we got a better score, store the score and parameters  
  20.         if score > best_score:  
  21.             best_score = score  
  22.             best_parameters = {'C': C, 'gamma': gamma}  
  23.   
  24. print("Best score: {:.2f}".format(best_score))  
  25. print("Best parameters: {}".format(best_parameters))  
Output: 
Size of training set: 112 size of test set: 38
Best score: 0.97
Best parameters: {'C': 100, 'gamma': 0.001}

The Danger of Overfitting the Parameters and the Validation Set 
Given this result, we might be tempted to report that we found a model that performs with 97% accuracy on our dataset. However, this claim could be overly optimistic (or just wrong), for the following reason: we tried many different parameters and selected the one with best accuracy on the test set, but this accuracy won’t necessarily carry over to new data. Because we used the test data to adjust the parameters, we can no longer use it to assess how good the model is. This is the same reason we needed to split the data into training and test sets in the first place; we need an independent dataset to evaluate, one that was not used to create the model. 

One way to resolve this problem is to split the data again, so we have three sets: the training set to build the model, the validation (or developmentset to select the parameters of the model, and the test set to evaluate the performance of the selected parameters. Figure 6-5 shows what this looks like: 
Figure 6-5. A three fold split of data into training set, validation set, and test set 

After selecting the best parameters using the validation set, we can rebuild a model using the parameter settings we found, but now training on both the training data and the validation data. This way, we can use as much data as possible to build our model. This leads to the following implementation: 
- ch6_t10.py 
  1. from sklearn.datasets import load_iris  
  2. from sklearn.svm import SVC  
  3. from sklearn.model_selection import train_test_split  
  4.   
  5. iris = load_iris()  
  6. # split data into train+validation set and test set  
  7. X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, iris.target, random_state=0)  
  8. # split train+validation set into training and validation sets  
  9. X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)  
  10. print("Size of training set: {}   size of validation set: {}   size of test set:"  
  11.       " {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))  
  12.   
  13.   
  14. best_score = 0  
  15. for gamma in [0.0010.010.1110100]:  
  16.     for C in [0.0010.010.1110100]:  
  17.         # for each combination of parameters, train an SVC  
  18.         svm = SVC(gamma=gamma, C=C)  
  19.         svm.fit(X_train, y_train)  
  20.         # evaluate the SVC on the test set  
  21.         score = svm.score(X_valid, y_valid)  
  22.         # if we got a better score, store the score and parameters  
  23.         if score > best_score:  
  24.             best_score = score  
  25.             best_parameters = {'C': C, 'gamma': gamma}  
  26.   
  27. # rebuild a model on the combined training and validation set,  
  28. # and evaluate it on the test set  
  29. svm = SVC(**best_parameters)  
  30. svm.fit(X_trainval, y_trainval)  
  31. test_score = svm.score(X_test, y_test)  
  32. print("Best score on validation set: {:.2f}".format(best_score))  
  33. print("Best parameters: ", best_parameters)  
  34. print("Test set score with best parameters: {:.2f}".format(test_score))  
Output: 
Size of training set: 84 size of validation set: 28 size of test set: 38

Best score on validation set: 0.96
('Best parameters: ', {'C': 10, 'gamma': 0.001})
Test set score with best parameters: 0.92

The best score on the validation set is 96%: slightly lower than before, probably because we used less data to train the model (X_train is smaller now because we split our dataset twice). However, the score on the test set—the score that actually tells us how well we generalize—is even lower, at 92%. So we can only claim to classify new data 92% correctly, not 97% correctly as we thought before! 

The distinction between the training set, validation set, and test set is fundamentally important to applying machine learning methods in practice. Any choices made based on the test set accuracy “leak” information from the test set into the model. Therefore, it is important to keep a separate test set, which is only used for the final evaluation. It is good practice to do all exploratory analysis and model selection using the combination of a training and a validation set, and reserve the test set for a final evaluation—this is even true for exploratory visualization. Strictly speaking, evaluating more than one model on the test set and choosing the better of the two will result in an overly optimistic estimate of how accurate the model is. 

Grid Search with Cross-Validation 
While the method of splitting the data into a training, a validation, and a test set that we just saw is workable, and relatively commonly used, it is quite sensitive to how exactly the data is split. From the output of the previous code snippet we can see that GridSearchCV selects 'C': 10, 'gamma': 0.001 as the best parameters, while the output of the code in the previous section selects 'C': 100, 'gamma': 0.001 as the best parameters. For a better estimate of the generalization performance, instead of using a single split into a training and a validation set, we can use cross-validation to evaluate the performance of each parameter combination. This method can be coded up as follows: 
- ch6_t11.py 
  1. import numpy as np  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.svm import SVC  
  4. from sklearn.model_selection import cross_val_score  
  5. from sklearn.model_selection import train_test_split  
  6.   
  7. iris = load_iris()  
  8. # split data into train+validation set and test set  
  9. X_trainval, X_test, y_trainval, y_test = train_test_split(iris.data, iris.target, random_state=0)  
  10. # split train+validation set into training and validation sets  
  11. X_train, X_valid, y_train, y_valid = train_test_split(X_trainval, y_trainval, random_state=1)  
  12. print("Size of training set: {}   size of validation set: {}   size of test set:"  
  13.       " {}\n".format(X_train.shape[0], X_valid.shape[0], X_test.shape[0]))  
  14.   
  15.   
  16. best_score = 0  
  17. for gamma in [0.0010.010.1110100]:  
  18.     for C in [0.0010.010.1110100]:  
  19.         # for each combination of parameters, train an SVC  
  20.         svm = SVC(gamma=gamma, C=C)  
  21.         # perform cross-validation  
  22.         scores = cross_val_score(svm, X_trainval, y_trainval, cv=5)  
  23.         # compute mean cross-validation accuracy  
  24.         score = np.mean(scores)  
  25.         # if we got a better score, store the score and parameters  
  26.         if score > best_score:  
  27.             best_score = score  
  28.             best_parameters = {'C': C, 'gamma': gamma}  
  29.   
  30. # rebuild a model on the combined training and validation set,  
  31. # and evaluate it on the test set  
  32. svm = SVC(**best_parameters)  
  33. svm.fit(X_trainval, y_trainval)  
  34. test_score = svm.score(X_test, y_test)  
  35. print("Best score on validation set: {:.2f}".format(best_score))  
  36. print("Best parameters: ", best_parameters)  
  37. print("Test set score with best parameters: {:.2f}".format(test_score))  
To evaluate the accuracy of the SVM using a particular setting of C and gamma using five-fold cross-validation, we need to train 36 * 5 = 180 models. As you can imagine, the main downside of the use of cross-validation is the time it takes to train all these models
WARNING 
As we said earlier, cross-validation is a way to evaluate a given algorithm on a specific dataset. However, it is often used in conjunction with parameter search methods like grid search. For this reason, many people use the term cross-validation colloquially to refer to grid search with cross-validation.

The overall process of splitting the data, running the grid search, and evaluating the final parameters is illustrated in Figure 6-7: 
Figure 6-7. Overview of the process of parameter selection and model evaluation with GridSearchCV 

Because grid search with cross-validation is such a commonly used method to adjust parameters, scikit-learn provides the GridSearchCV class, which implements it in the form of an estimator. To use the GridSearchCV class, you first need to specify the parameters you want to search over using a dictionary. GridSearchCV will then perform all the necessary model fits. The keys of the dictionary are the names of parameters we want to adjust (as given when constructing the model—in this case, C and gamma), and the values are the parameter settings we want to try out. Trying the values 0.001, 0.01, 0.1, 1, 10, and 100 for C and gamma translates to the following dictionary: 
  1. param_grid = {'C': [0.0010.010.1110100],  
  2.               'gamma': [0.0010.010.1110100]}  
  3. print("Parameter grid:\n{}".format(param_grid))  
Output: 
Parameter grid:
{'C': [0.001, 0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1, 10, 100]}

We can now instantiate the GridSearchCV class with the model (SVC), the parameter grid to search (param_grid), and the cross-validation strategy we want to use (say, five-fold stratified cross-validation): 
  1. from sklearn.model_selection import GridSearchCV  
  2. from sklearn.svm import SVC  
  3. grid_search = GridSearchCV(SVC(), param_grid, cv=5)  
GridSearchCV will use cross-validation in place of the split into a training and validation set that we used before. However, we still need to split the data into a training and a test set, to avoid overfitting the parameters: 
  1. X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, random_state=0)  
The grid_search object that we created behaves just like a classifier; we can call the standard methods fitpredict, and score on it. However, when we call fit, it will run cross-validation for each combination of parameters we specified in param_grid
  1. grid_search.fit(X_train, y_train)  
Fitting the GridSearchCV object not only searches for the best parameters, but also automatically fits a new model on the whole training dataset with the parameters that yielded the best cross-validation performance. The GridSearchCV class provides a very convenient interface to access the retrained model using the predict and score methods. To evaluate how well the best found parameters generalize, we can call score on the test set: 
  1. print("Test set score: {:.2f}".format(grid_search.score(X_test, y_test)))  
Output: 
Test set score: 0.97

Choosing the parameters using cross-validation, we actually found a model that achieves 97% accuracy on the test set. The important thing here is that we did not use the test set to choose the parameters. The parameters that were found are scored in the best_params_ attribute, and the best cross-validation accuracy (the mean accuracy over the different splits for this parameter setting) is stored in best_score_
  1. print("Best parameters: {}".format(grid_search.best_params_))  
  2. print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))  
Output: 
Best parameters: {'C': 100, 'gamma': 0.01}
Best cross-validation score: 0.97

WARNING 
Again, be careful not to confuse best_score_ with the generalization performance of the model as computed by the score method on the test set. Using the score method (or evaluating the output of the predict method) employs a model trained on the whole training set. The best_score_ attribute stores the mean cross-validation accuracy, with cross-validation performed on the training set.

Sometimes it is helpful to have access to the actual model that was found—for example, to look at coefficients or feature importances. You can access the model with the best parameters trained on the whole training set using the best_estimator_ attribute: 
  1. print("Best estimator:\n{}".format(grid_search.best_estimator_))  
Output: 
Best estimator:
SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

ANALYZING THE RESULT OF CROSS-VALIDATION 
It is often helpful to visualize the results of cross-validation, to understand how the model generalization depends on the parameters we are searching. As grid searches are quite computationally expensive to run, often it is a good idea to start with a relatively coarse and small grid. We can then inspect the results of the cross-validated grid search, and possibly expand our search. The results of a grid search can be found in the cv_results_ attribute, which is a dictionary storing all aspects of the search. It contains a lot of details, as you can see in the following output, and is best looked at after converting it to a pandas DataFrame: (ch6_t12.py
  1. from ch6_t12 import *  
  2. import pandas as pd  
  3. # convert to DataFrame  
  4. results = pd.DataFrame(grid_search.cv_results_)  
  5. # show the first 5 rows  
  6. print(results.head())  
Output: 

Each row in results corresponds to one particular parameter setting. For each setting, the results of all cross-validation splits are recorded, as well as the mean and standard deviation over all splits. As we were searching a two-dimensional grid of parameters (C and gamma), this is best visualized as a heat map (Figure 6-8). First we extract the mean validation scores, then we reshape the scores so that the axes correspond to C and gamma
- ch6_t13.py 
  1. import numpy as np  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.svm import SVC  
  4. from sklearn.model_selection import GridSearchCV  
  5. from sklearn.model_selection import cross_val_score  
  6. from sklearn.model_selection import train_test_split  
  7.   
  8. from ch6_t12 import *  
  9. import pandas as pd  
  10. # convert to DataFrame  
  11. results = pd.DataFrame(grid_search.cv_results_)  
  12. # show the first 5 rows  
  13. print(results.head())  
  14.   
  15. import mglearn  
  16. scores = np.array(results.mean_test_score).reshape(66)  
  17.   
  18. # plot the mean cross-validation scores  
  19. mglearn.tools.heatmap(scores, xlabel='gamma', xticklabels=param_grid['gamma'],  
  20.                       ylabel='C', yticklabels=param_grid['C'], cmap="viridis")  
Figure 6-8. Heat map of mean cross-validation score as a function of C and gamma 

Each point in the heat map corresponds to one run of cross-validation, with a particular parameter setting. The color encodes the cross-validation accuracy, with light colors meaning high accuracy and dark colors meaning low accuracy. You can see that SVC is very sensitive to the setting of the parameters. For many of the parameter settings, the accuracy is around 40%, which is quite bad; for other settings the accuracy is around 96%. We can take away from this plot several things. First, the parameters we adjusted are very important for obtaining good performance. Both parameters (C and gammamatter a lot, as adjusting them can change the accuracy from 40% to 96%. Additionally, the ranges we picked for the parameters are ranges in which we see significant changes in the outcome. It’s also important to note that the ranges for the parameters are large enough: the optimum values for each parameter are not on the edges of the plot. 

Now let’s look at some plots (shown in Figure 6-9) where the result is less ideal, because the search ranges were not chosen properly: 
Figure 6-9. Heat map visualizations of misspecified search grids 

The first panel shows no changes at all, with a constant color over the whole parameter grid. In this case, this is caused by improper scaling and range of the parameters C and gamma. However, if no change in accuracy is visible over the different parameter settings, it could also be that a parameter is just not important at all. It is usually good to try very extreme values first, to see if there are any changes in the accuracy as a result of changing a parameter; The second panel shows a vertical stripe pattern. This indicates that only the setting of the gamma parameter makes any difference. This could mean that the gamma parameter is searching over interesting values but the C parameter is not—or it could mean the C parameter is not important. 

The third panel shows changes in both C and gamma. However, we can see that in the entire bottom left of the plot, nothing interesting is happening. We can probably exclude the very small values from future grid searches. The optimum parameter setting is at the top right. As the optimum is in the border of the plot, we can expect that there might be even better values beyond this border, and we might want to change our search range to include more parameters in this region. 

Tuning the parameter grid based on the cross-validation scores is perfectly fine, and a good way to explore the importance of different parameters. However, you should not test different parameter ranges on the final test set—as we discussed earlier, evaluation of the test set should happen only once we know exactly what model we want to use

SEARCH OVER SPACES THAT ARE NOT GRIDS 
In some cases, trying all possible combinations of all parameters as GridSearchCV usually does, is not a good idea. For example, SVC has a kernel parameter, and depending on which kernel is chosen, other parameters will be relevant. If kernel='linear', the model is linear, and only the C parameter is used. If kernel='rbf', both the C and gamma parameters are used (but not other parameters like degree). In this case, searching over all possible combinations of Cgamma, and kernel wouldn’t make sense: if kernel='linear', gamma is not used, and trying different values for gamma would be a waste of time. To deal with these kinds of “conditional” parameters, GridSearchCV allows the param_grid to be a list of dictionaries. Each dictionary in the list is expanded into an independent grid. A possible grid search involving kernel and parameters could look like this: 
  1. param_grid = [{'kernel': ['rbf'],  
  2.                'C': [0.0010.010.1110100],  
  3.                'gamma': [0.0010.010.1110100]},  
  4.               {'kernel': ['linear'],  
  5.                'C': [0.0010.010.1110100]}]  
  6. print("List of grids:\n{}".format(param_grid))  
In the first grid, the kernel parameter is always set to 'rbf' (not that the entry for kernel is a list of length one), and both the C and gamma parameters are varied. In the second grid, the kernel parameter is always set to linear, and only C is varied. Now let’s apply this more complex parameter search: 
  1. grid_search = GridSearchCV(SVC(), param_grid, cv=5)  
  2. grid_search.fit(X_train, y_train)  
  3. print("Best parameters: {}".format(grid_search.best_params_))  
  4. print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))  
Output: 
Best parameters: {'C': 100, 'kernel': 'rbf', 'gamma': 0.01}
Best cross-validation score: 0.97


USING DIFFERENT CROSS-VALIDATION STRATEGIES WITH GRID SEARCH 
Similarly to cross_val_scoreGridSearchCV uses stratified k-fold cross-validation by default for classification, and k-fold cross-validation for regression. However, you can also pass any cross-validation splitter, as described in “More control over cross-validation”, as the cv parameter in GridSearchCV. In particular, to get only a single split into a training and a validation set, you can use ShuffleSplit or StratifiedShuffleSplit with n_iter=1. This might be helpful for very large datasets, or very slow models. 

NESTED CROSS-VALIDATION 
In the preceding examples, we went from using a single split of the data into training, validation, and test sets to splitting the data into training and test sets and then performing cross-validation on the training set. But when using GridSearchCV as described earlier, we still have a single split of the data into training and test sets, which might make our results unstable and make us depend too much on this single split of the data. We can go a step further, and instead of splitting the original data into training and test sets once, use multiple splits of cross-validation. This will result in what is called nested cross-validation. In nested cross-validation, there is an outer loop over splits of the data into training and test sets. For each of them, a grid search is run (which might result in different best parameters for each split in the outer loop). Then, for each outer split, the test set score using the best settings is reported. 

The result of this procedure is a list of scores—not a model, and not a parameter setting. The scores tell us how well a model generalizes, given the best parameters found by the grid. As it doesn’t provide a model that can be used on new data, nested cross-validation is rarely used when looking for a predictive model to apply to future data. However, it can be useful for evaluating how well a given model works on a particular dataset

Implementing nested cross-validation in scikit-learn is straightforward. We call cross_val_score with an instance of GridSearchCV as the model: 
  1. scores = cross_val_score(GridSearchCV(SVC(), param_grid, cv=5),  
  2.                          iris.data, iris.target, cv=5)  
  3. print("Cross-validation scores: ", scores)  
  4. print("Mean cross-validation score: ", scores.mean())  
Output: 
Cross-validation scores: [ 0.967 1. 0.967 0.967 1. ]
Mean cross-validation score: 0.98

The result of our nested cross-validation can be summarized as “SVC can achieve 98% mean cross-validation accuracy on the iris dataset”—nothing more and nothing less. 

Here, we used stratified five-fold cross-validation in both the inner and the outer loop. As our param_grid contains 36 combinations of parameters, this results in a whopping 36 * 5 * 5 = 900 models being built, making nested cross-validation a very expensive procedure. Here, we used the same cross-validation splitter in the inner and the outer loop; however, this is not necessary and you can use any combination of cross-validation strategies in the inner and outer loops. It can be a bit tricky to understand what is happening in the single line given above, and it can be helpful to visualize it as for loops, as done in the following simplified implementation: 
  1. def nested_cv(X, y, inner_cv, outer_cv, Classifier, parameter_grid):  
  2.     outer_scores = []  
  3.     # for each split of the data in the outer cross-validation  
  4.     # (split method returns indices of training and test parts)  
  5.     for training_samples, test_samples in outer_cv.split(X, y):  
  6.         # find best parameter using inner cross-validation  
  7.         best_parms = {}  
  8.         best_score = -np.inf  
  9.         # iterate over parameters  
  10.         for parameters in parameter_grid:  
  11.             # accumulate score over inner splits  
  12.             cv_scores = []  
  13.             # iterate over inner cross-validation  
  14.             for inner_train, inner_test in inner_cv.split(  
  15.                     X[training_samples], y[training_samples]):  
  16.                 # build classifier given parameters and training data  
  17.                 clf = Classifier(**parameters)  
  18.                 clf.fit(X[inner_train], y[inner_train])  
  19.                 # evaluate on inner test set  
  20.                 score = clf.score(X[inner_test], y[inner_test])  
  21.                 cv_scores.append(score)  
  22.             # compute mean score over inner folds  
  23.             mean_score = np.mean(cv_scores)  
  24.             if mean_score > best_score:  
  25.                 # if better than so far, remember parameters  
  26.                 best_score = mean_score  
  27.                 best_params = parameters  
  28.         # build classifier on best parameters using outer training set  
  29.         clf = Classifier(**best_params)  
  30.         clf.fit(X[training_samples], y[training_samples])  
  31.         # evaluate  
  32.         outer_scores.append(clf.score(X[test_samples], y[test_samples]))  
  33.     return np.array(outer_scores)  
Now, let’s run this function on the iris dataset: 
  1. from sklearn.model_selection import ParameterGrid, StratifiedKFold  
  2. scores = nested_cv(iris.data, iris.target, StratifiedKFold(5),  
  3.           StratifiedKFold(5), SVC, ParameterGrid(param_grid))  
  4. print("Cross-validation scores: {}".format(scores))  
Output: 
Cross-validation scores: [ 0.967 1. 0.967 0.967 1. ]

PARALLELIZING CROSS-VALIDATION AND GRID SEARCH 
While running a grid search over many parameters and on large datasets can be computationally challenging, it is also embarrassingly parallel. This means that building a model using a particular parameter setting on a particular cross-validation split can be done completely independently from the other parameter settings and models. This makes grid search and cross-validation ideal candidates for parallelization over multiple CPU cores or over a cluster. You can make use of multiple cores in GridSearchCV and cross_val_score by setting the n_jobs parameter to the number of CPU cores you want to use. You can set n_jobs=-1 to use all available cores. 

You should be aware that scikit-learn does not allow nesting of parallel operations. So, if you are using the n_jobs option on your model (for example, a random forest), you cannot use it in GridSearchCV to search over this model. If your dataset and model are very large, it might be that using many cores uses up too much memory, and you should monitor your memory usage when building large models in parallel. 

It is also possible to parallelize grid search and cross-validation over multiple machines in a cluster, although at the time of writing this is not supported within scikit-learn. It is, however, possible to use the IPython parallel framework for parallel grid searches, if you don’t mind writing the for loop over parameters as we did in “Simple Grid Search”. For Spark users, there is also the recently developed spark-sklearn package, which allows running a grid search over an already established Spark cluster.

[ Python 文章收集 ] Timing and Profiling in IPython

Source From  Here   Preface   Timing and profiling code is all sorts of useful, and it’s also just good ol’ fashioned fun ( and sometimes s...