2016年12月29日 星期四

[ Scikit- learn ] How to find the best model parameters in scikit-learn

Source From Here 
Preface 

Agenda 
* How can K-fold cross-validation be used to search for an optimal tuning parameter?
* How can this process be made more efficient?
* How do you search for multiple tuning parameters at once?
* What do you do with those tuning parameters before making real predictions?
* How can the computational expense of this process be reduced

Review of K-fold cross-validation 
Steps for cross-validation: 
* Dataset is split into K "folds" of equal size
* Each fold acts as the testing set 1 time, and acts as the training set K-1 times
* Average testing performance is used as the estimate of out-of-sample performance

Benefits of cross-validation: 
* More reliable estimate of out-of-sample performance than train/test split
* Can be used for selecting tuning parameters, choosing between models, and selection of features.

Drawback of cross-validation: 
* Can be computationally expensive

Review of parameter tuning using cross_val_score 
Goal: Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset 
- test1.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.neighbors import KNeighborsClassifier  
  4. from sklearn.cross_validation import cross_val_score  
  5. import matplotlib.pyplot as plt  
  6.   
  7. # Read in the iris data  
  8. iris = load_iris()  
  9.   
  10. # Create X(features) and y(response)  
  11. X = iris.data  
  12. y = iris.target  
  13.   
  14. 10-fold cross-validation with K=5 for KNN  
  15. knn = KNeighborsClassifier(n_neighbors=5)  
  16. scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  
  17. print "KNN(n=5) with accuracy=%.02f on iris dataset." % scores.mean()  
  18.   
  19. # Search for an optimal value of K for KNN  
  20. k_range = range(131)  
  21. k_scores = []  
  22. for k in k_range:  
  23.     knn =  KNeighborsClassifier(n_neighbors=k)  
  24.     scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  
  25.     k_scores.append(scores.mean())  
  26.   
  27. plt.plot(k_range, k_scores)  
  28. plt.xlabel("Value of K for KNN")  
  29. plt.ylabel("Cross-Validated Accruacy")  
  30. plt.show()  

Execution result: 


More efficient parameter tuning using GridSearchCV 
- test2.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.neighbors import KNeighborsClassifier  
  4. from sklearn.cross_validation import cross_val_score  
  5. import matplotlib.pyplot as plt  
  6. from sklearn.grid_search import GridSearchCV  
  7.   
  8. # Read in the iris data  
  9. iris = load_iris()  
  10.   
  11. # Create X(features) and y(response)  
  12. X = iris.data  
  13. y = iris.target  
  14.   
  15. 10-fold cross-validation with K=5 for KNN  
  16. knn = KNeighborsClassifier(n_neighbors=5)  
  17.   
  18. # Define the parameter values that should be searched  
  19. k_range = range(131)  
  20.   
  21. # Create a parameter grid: map the parameter names to the values that should be searched  
  22. param_grid = dict(n_neighbors=k_range)  
  23.   
  24. # Instantiate the grid  
  25. grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')  
  26.   
  27. # fit the grid with data  
  28. grid.fit(X, y)  
  29.   
  30. # Examine the first tuple  
  31. print "Tuple0 using parameter=%s" % grid.grid_scores_[0].parameters  
  32. print "Tuple0 scores of 10-fold CV scores:\n%s\n" % grid.grid_scores_[0].cv_validation_scores  
  33. print "Tuple0 with mean of 10-fold CV score=%.02f" % grid.grid_scores_[0].mean_validation_score  
  34.   
  35. # Create a list of the mean scores only  
  36. grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]  
  37.   
  38. # Plot the results  
  39. #import matplotlib.pyplot as plt  
  40. #plt.plot(k_range, grid_mean_scores)  
  41. #plt.xlabel("Value of K for KNN")  
  42. #plt.ylabel("Cross-Validated Accuracy")  
  43. #plt.show()  
  44.   
  45. # Examine the best model  
  46. print "Best score=%.02f" % grid.best_score_  
  47. print "Best param=%s" % grid.best_params_  
  48. print "Best etimator:\n%s\n" % grid.best_estimator_  
What if we have more than one parameters to do optimization? 

Searching multiple parameters simultaneously 
* Example: tuning max_depth and min_samples_leaf for DecisionTreeClassifier
* Could tune parameters independently: change max_depth while leaving min_sample_leaf at default value, and vice versa
* But, best performance might be achieved when neighbor parameter is at its default value

- test3.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.neighbors import KNeighborsClassifier  
  4. from sklearn.cross_validation import cross_val_score  
  5. import matplotlib.pyplot as plt  
  6. from sklearn.grid_search import GridSearchCV  
  7.   
  8. # Read in the iris data  
  9. iris = load_iris()  
  10.   
  11. # Create X(features) and y(response)  
  12. X = iris.data  
  13. y = iris.target  
  14.   
  15. 10-fold cross-validation with K=5 for KNN  
  16. knn = KNeighborsClassifier(n_neighbors=5)  
  17.   
  18. # define the parameter values that should be searched  
  19. k_range = range(131)  
  20. weight_options = ['uniform''distance']  
  21.   
  22. # Create a parameter grid: map the parameter names to the values that should be searched  
  23. param_grid = dict(n_neighbors=k_range, weights=weight_options)  
  24.   
  25. # Instantiate and fit the grid  
  26. grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')  
  27. grid.fit(X, y)  
  28.   
  29. # Examine the first tuple  
  30. print "Tuple0 using parameter=%s" % grid.grid_scores_[0].parameters  
  31. print "Tuple0 scores of 10-fold CV scores:\n%s\n" % grid.grid_scores_[0].cv_validation_scores  
  32. print "Tuple0 with mean of 10-fold CV score=%.02f" % grid.grid_scores_[0].mean_validation_score  
  33.   
  34. # Create a list of the mean scores only  
  35. grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]  
  36.   
  37. # Plot the results  
  38. #import matplotlib.pyplot as plt  
  39. #plt.plot(k_range, grid_mean_scores)  
  40. #plt.xlabel("Value of K for KNN")  
  41. #plt.ylabel("Cross-Validated Accuracy")  
  42. #plt.show()  
  43.   
  44. # Examine the best model  
  45. print "Best score=%.02f" % grid.best_score_  
  46. print "Best param=%s" % grid.best_params_  
  47. print "Best etimator:\n%s\n" % grid.best_estimator_  
Using the best parameters to make predictions 
  1. # train your model using all data and the best known parameters  
  2. knn = KNeighborsClassifier(n_neighbors=13, weights='uniform')  
  3. knn.fit(X, y)  
  4.   
  5. # make a prediction on out-of-sample data  
  6. knn.predict([3542])  
  7.   
  8. # Shortcut: GridSearchCV automatically refits the best model using all of the data  
  9. grid.predict([3542])  
Reducing Computational Expense Using RandomizedSearchCV 
* Searching many different parameters at once may be computationally infeasible. 
* RandomizedSearchCV searches a subset of the parameters, and you control the computational "budget" 
- test4.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.neighbors import KNeighborsClassifier  
  4. from sklearn.cross_validation import cross_val_score  
  5. import matplotlib.pyplot as plt  
  6. from sklearn.grid_search import RandomizedSearchCV  
  7.   
  8. # Read in the iris data  
  9. iris = load_iris()  
  10.   
  11. # Create X(features) and y(response)  
  12. X = iris.data  
  13. y = iris.target  
  14.   
  15. 10-fold cross-validation with K=5 for KNN  
  16. knn = KNeighborsClassifier(n_neighbors=5)  
  17.   
  18. # define the parameter values that should be searched  
  19. k_range = range(131)  
  20. weight_options = ['uniform''distance']  
  21.   
  22. # Specify "parameter ditributions" rather than a "parameter grid"  
  23. param_dist = dict(n_neighbors=k_range, weights=weight_options)  
  24.   
  25. # n_iter controls the number of searches  
  26. rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5)  
  27. rand.fit(X, y)  
  28. print rand.grid_scores_  
  29.   
  30. # Examine the best model  
  31. print "Best score=%.02f" % rand.best_score_  
  32. print "Best param=%s" % rand.best_params_  
  33. print "Best etimator:\n%s\n" % rand.best_estimator_  
  34.   
  35. # Run RandomizedSearchCV 20 times (with n_iter=10) and record the best score  
  36. best_scores = []  
  37. for i in range(20):  
  38.     rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10)  
  39.     rand.fit(X, y)  
  40.     best_scores.append(round( rand.best_score_, 3))  
  41. print "Best scores collect in 20 times:\n%s\n" % best_scores  
Supplement 
Grid search user guide: http://scikit-learn.org/stable/module... 
GridSearchCV documentation: http://scikit-learn.org/stable/module... 
RandomizedSearchCV documentation: http://scikit-learn.org/stable/module... 
Comparing randomized search and grid search: http://scikit-learn.org/stable/auto_e... 
Randomized search video: https://youtu.be/0wUF_Ov8b0A?t=17m38s 
Randomized search notebook: http://nbviewer.ipython.org/github/am... 
Random Search for Hyper-Parameter Optimization (paper): http://www.jmlr.org/papers/volume13/b...

[Git 文章收集] Differences between git merge and git rebase

Source From  Here Preface Merging and rebasing are the two most popular way to applying changes from one branch into another one. They bot...