2016年12月29日 星期四

[ Scikit- learn ] How to find the best model parameters in scikit-learn

Source From Here 
Preface 

Agenda 
* How can K-fold cross-validation be used to search for an optimal tuning parameter?
* How can this process be made more efficient?
* How do you search for multiple tuning parameters at once?
* What do you do with those tuning parameters before making real predictions?
* How can the computational expense of this process be reduced

Review of K-fold cross-validation 
Steps for cross-validation: 
* Dataset is split into K "folds" of equal size
* Each fold acts as the testing set 1 time, and acts as the training set K-1 times
* Average testing performance is used as the estimate of out-of-sample performance

Benefits of cross-validation: 
* More reliable estimate of out-of-sample performance than train/test split
* Can be used for selecting tuning parameters, choosing between models, and selection of features.

Drawback of cross-validation: 
* Can be computationally expensive

Review of parameter tuning using cross_val_score 
Goal: Select the best tuning parameters (aka "hyperparameters") for KNN on the iris dataset 
- test1.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.neighbors import KNeighborsClassifier  
  4. from sklearn.cross_validation import cross_val_score  
  5. import matplotlib.pyplot as plt  
  6.   
  7. # Read in the iris data  
  8. iris = load_iris()  
  9.   
  10. # Create X(features) and y(response)  
  11. X = iris.data  
  12. y = iris.target  
  13.   
  14. 10-fold cross-validation with K=5 for KNN  
  15. knn = KNeighborsClassifier(n_neighbors=5)  
  16. scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  
  17. print "KNN(n=5) with accuracy=%.02f on iris dataset." % scores.mean()  
  18.   
  19. # Search for an optimal value of K for KNN  
  20. k_range = range(131)  
  21. k_scores = []  
  22. for k in k_range:  
  23.     knn =  KNeighborsClassifier(n_neighbors=k)  
  24.     scores = cross_val_score(knn, X, y, cv=10, scoring='accuracy')  
  25.     k_scores.append(scores.mean())  
  26.   
  27. plt.plot(k_range, k_scores)  
  28. plt.xlabel("Value of K for KNN")  
  29. plt.ylabel("Cross-Validated Accruacy")  
  30. plt.show()  

Execution result: 


More efficient parameter tuning using GridSearchCV 
- test2.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.neighbors import KNeighborsClassifier  
  4. from sklearn.cross_validation import cross_val_score  
  5. import matplotlib.pyplot as plt  
  6. from sklearn.grid_search import GridSearchCV  
  7.   
  8. # Read in the iris data  
  9. iris = load_iris()  
  10.   
  11. # Create X(features) and y(response)  
  12. X = iris.data  
  13. y = iris.target  
  14.   
  15. 10-fold cross-validation with K=5 for KNN  
  16. knn = KNeighborsClassifier(n_neighbors=5)  
  17.   
  18. # Define the parameter values that should be searched  
  19. k_range = range(131)  
  20.   
  21. # Create a parameter grid: map the parameter names to the values that should be searched  
  22. param_grid = dict(n_neighbors=k_range)  
  23.   
  24. # Instantiate the grid  
  25. grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')  
  26.   
  27. # fit the grid with data  
  28. grid.fit(X, y)  
  29.   
  30. # Examine the first tuple  
  31. print "Tuple0 using parameter=%s" % grid.grid_scores_[0].parameters  
  32. print "Tuple0 scores of 10-fold CV scores:\n%s\n" % grid.grid_scores_[0].cv_validation_scores  
  33. print "Tuple0 with mean of 10-fold CV score=%.02f" % grid.grid_scores_[0].mean_validation_score  
  34.   
  35. # Create a list of the mean scores only  
  36. grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]  
  37.   
  38. # Plot the results  
  39. #import matplotlib.pyplot as plt  
  40. #plt.plot(k_range, grid_mean_scores)  
  41. #plt.xlabel("Value of K for KNN")  
  42. #plt.ylabel("Cross-Validated Accuracy")  
  43. #plt.show()  
  44.   
  45. # Examine the best model  
  46. print "Best score=%.02f" % grid.best_score_  
  47. print "Best param=%s" % grid.best_params_  
  48. print "Best etimator:\n%s\n" % grid.best_estimator_  
What if we have more than one parameters to do optimization? 

Searching multiple parameters simultaneously 
* Example: tuning max_depth and min_samples_leaf for DecisionTreeClassifier
* Could tune parameters independently: change max_depth while leaving min_sample_leaf at default value, and vice versa
* But, best performance might be achieved when neighbor parameter is at its default value

- test3.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.neighbors import KNeighborsClassifier  
  4. from sklearn.cross_validation import cross_val_score  
  5. import matplotlib.pyplot as plt  
  6. from sklearn.grid_search import GridSearchCV  
  7.   
  8. # Read in the iris data  
  9. iris = load_iris()  
  10.   
  11. # Create X(features) and y(response)  
  12. X = iris.data  
  13. y = iris.target  
  14.   
  15. 10-fold cross-validation with K=5 for KNN  
  16. knn = KNeighborsClassifier(n_neighbors=5)  
  17.   
  18. # define the parameter values that should be searched  
  19. k_range = range(131)  
  20. weight_options = ['uniform''distance']  
  21.   
  22. # Create a parameter grid: map the parameter names to the values that should be searched  
  23. param_grid = dict(n_neighbors=k_range, weights=weight_options)  
  24.   
  25. # Instantiate and fit the grid  
  26. grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy')  
  27. grid.fit(X, y)  
  28.   
  29. # Examine the first tuple  
  30. print "Tuple0 using parameter=%s" % grid.grid_scores_[0].parameters  
  31. print "Tuple0 scores of 10-fold CV scores:\n%s\n" % grid.grid_scores_[0].cv_validation_scores  
  32. print "Tuple0 with mean of 10-fold CV score=%.02f" % grid.grid_scores_[0].mean_validation_score  
  33.   
  34. # Create a list of the mean scores only  
  35. grid_mean_scores = [result.mean_validation_score for result in grid.grid_scores_]  
  36.   
  37. # Plot the results  
  38. #import matplotlib.pyplot as plt  
  39. #plt.plot(k_range, grid_mean_scores)  
  40. #plt.xlabel("Value of K for KNN")  
  41. #plt.ylabel("Cross-Validated Accuracy")  
  42. #plt.show()  
  43.   
  44. # Examine the best model  
  45. print "Best score=%.02f" % grid.best_score_  
  46. print "Best param=%s" % grid.best_params_  
  47. print "Best etimator:\n%s\n" % grid.best_estimator_  
Using the best parameters to make predictions 
  1. # train your model using all data and the best known parameters  
  2. knn = KNeighborsClassifier(n_neighbors=13, weights='uniform')  
  3. knn.fit(X, y)  
  4.   
  5. # make a prediction on out-of-sample data  
  6. knn.predict([3542])  
  7.   
  8. # Shortcut: GridSearchCV automatically refits the best model using all of the data  
  9. grid.predict([3542])  
Reducing Computational Expense Using RandomizedSearchCV 
* Searching many different parameters at once may be computationally infeasible. 
* RandomizedSearchCV searches a subset of the parameters, and you control the computational "budget" 
- test4.py 
  1. #!/usr/bin/env python  
  2. from sklearn.datasets import load_iris  
  3. from sklearn.neighbors import KNeighborsClassifier  
  4. from sklearn.cross_validation import cross_val_score  
  5. import matplotlib.pyplot as plt  
  6. from sklearn.grid_search import RandomizedSearchCV  
  7.   
  8. # Read in the iris data  
  9. iris = load_iris()  
  10.   
  11. # Create X(features) and y(response)  
  12. X = iris.data  
  13. y = iris.target  
  14.   
  15. 10-fold cross-validation with K=5 for KNN  
  16. knn = KNeighborsClassifier(n_neighbors=5)  
  17.   
  18. # define the parameter values that should be searched  
  19. k_range = range(131)  
  20. weight_options = ['uniform''distance']  
  21.   
  22. # Specify "parameter ditributions" rather than a "parameter grid"  
  23. param_dist = dict(n_neighbors=k_range, weights=weight_options)  
  24.   
  25. # n_iter controls the number of searches  
  26. rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10, random_state=5)  
  27. rand.fit(X, y)  
  28. print rand.grid_scores_  
  29.   
  30. # Examine the best model  
  31. print "Best score=%.02f" % rand.best_score_  
  32. print "Best param=%s" % rand.best_params_  
  33. print "Best etimator:\n%s\n" % rand.best_estimator_  
  34.   
  35. # Run RandomizedSearchCV 20 times (with n_iter=10) and record the best score  
  36. best_scores = []  
  37. for i in range(20):  
  38.     rand = RandomizedSearchCV(knn, param_dist, cv=10, scoring='accuracy', n_iter=10)  
  39.     rand.fit(X, y)  
  40.     best_scores.append(round( rand.best_score_, 3))  
  41. print "Best scores collect in 20 times:\n%s\n" % best_scores  
Supplement 
Grid search user guide: http://scikit-learn.org/stable/module... 
GridSearchCV documentation: http://scikit-learn.org/stable/module... 
RandomizedSearchCV documentation: http://scikit-learn.org/stable/module... 
Comparing randomized search and grid search: http://scikit-learn.org/stable/auto_e... 
Randomized search video: https://youtu.be/0wUF_Ov8b0A?t=17m38s 
Randomized search notebook: http://nbviewer.ipython.org/github/am... 
Random Search for Hyper-Parameter Optimization (paper): http://www.jmlr.org/papers/volume13/b...

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...