2020年9月16日 星期三

[ Python 常見問題 ] Explicitly specifying test/train sets in GridSearchCV

Source From Here
Question
I have a question about the cv parameter of sklearn's GridSearchCV. I'm working with data that has a time component to it, so I don't think random shuffling within KFold cross-validation seems sensible.

Instead, I want to explicitly specify cutoffs for training, validation, and test data within a GridSearchCV. Can I do this?

To better illuminate the question, here's how I would to that manually:
  1. import numpy as np  
  2. import pandas as pd  
  3.   
  4. # Scikit-Learn for fitting models  
  5. from sklearn.preprocessing import PolynomialFeatures  
  6. from sklearn.linear_model import LinearRegression  
  7. from sklearn.model_selection import cross_val_score  
  8. from sklearn.metrics import mean_squared_error  
  9.   
  10. # For plotting in the notebook  
  11. import matplotlib  
  12. import matplotlib.pyplot as plt  
  13.   
  14. np.random.seed(42)  
  15.   
  16. "True" generating function representing a process in real life  
  17. def true_gen(X):  
  18.     y = np.sin(1.2 * X * np.pi)   
  19.     return(y)  
  20.   
  21. # x values and y value with a small amount of random noise  
  22. X = np.sort(np.random.rand(120))  
  23. y = true_gen(X) + 0.1 * np.random.randn(len(X))  
  24. y[:20]  
Output:
  1. array([-0.04938892,  0.04475764,  0.05647044, -0.02814329,  0.15889086,  
  2.         0.19578347,  0.17473167,  0.19376568,  0.09578607,  0.20072338,  
  3.         0.24125928,  0.19713621,  0.27002241,  0.36786014,  0.54856942,  
  4.         0.41307619,  0.44881168,  0.42829488,  0.25213667,  0.49932245])  
Then is for training/testing dataset:
  1. random_ind = np.random.choice(list(range(120)), size = 120, replace=False)  
  2. xt = X[random_ind]  
  3. yt = y[random_ind]  
  4.   
  5. # Training and testing observations  
  6. train_size = int(0.7 * len(X))   
  7. X_train = xt[:train_size]  
  8. X_test = xt[train_size:]  
  9.   
  10. y_train = yt[:train_size]  
  11. y_test = yt[train_size:]  
  12.   
  13. # Model the true curve  
  14. x_linspace = np.linspace(011000)  
  15. y_true = true_gen(x_linspace)  
  16.   
  17. # Visualize observations and true curve  
  18. plt.plot(X_train, y_train, 'ko', label = 'Train');   
  19. plt.plot(X_test, y_test, 'ro', label = 'Test')  
  20. plt.plot(x_linspace, y_true, 'b-', linewidth = 2, label = 'True Function')  
  21. plt.legend()  
  22. plt.xlabel('x'); plt.ylabel('y'); plt.title('Data');  


So I want sklearn's GridSearchCV. to skip cross validation and use X_test/y_test to do testing instead.

HowTo
You can use the PredefinedSplit which does this very thing. So you need to make the following changes to your code:
  1. xt_df = pd.DataFrame({'x':xt})  
  2.   
  3. # The indices which have the value -1 will be kept in train.  
  4. train_indices = np.full((100,), -1, dtype=int)  
  5. train_indices  
Output:
  1. array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  2.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  3.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  4.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  5.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  6.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])  
Then is for testing indice:
  1. # The indices which have zero or positive values, will be kept in test  
  2. test_indices = np.full((20,), 0, dtype=int)  
  3. test_fold = np.append(train_indices, test_indices)  
  4. test_fold  
Output:
  1. array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  2.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  3.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  4.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  5.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
  6.        -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  
  7.         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  
  8.         0])  
Prepare preferred split:
  1. from sklearn.model_selection import PredefinedSplit  
  2. ps = PredefinedSplit(test_fold)  
  3. # Check how many splits will be done, based on test_fold  
  4. ps.get_n_splits()  # Should output 1  
  5.   
  6. for train_index, test_index in ps.split():  
  7.     print("TRAIN:", train_index, "TEST:", test_index)  
Output:
  1. TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  
  2. 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47  
  3. 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71  
  4. 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95  
  5. 96 97 98 99] TEST: [100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117  
  6. 118 119]  
Next
  1. # And now, send this `ps` to cv param in GridSearchCV  
  2. from sklearn.model_selection import GridSearchCV  
  3. from sklearn.linear_model import Ridge  
  4. from sklearn.preprocessing import PolynomialFeatures  
  5. from sklearn.pipeline import make_pipeline  
  6.   
  7. features = PolynomialFeatures(degree=5, include_bias=False)  
  8. model = Ridge(random_state=44)  
  9. pipe = make_pipeline(  
  10.     features,  
  11.     model  
  12. )  
  13. pipe  
Output:
  1. Pipeline(steps=[('polynomialfeatures',  
  2.                  PolynomialFeatures(degree=5, include_bias=False)),  
  3.                 ('ridge', Ridge(random_state=44))])  
Next:
  1. param_grid = {  
  2.     'polynomialfeatures__degree': range(110),  
  3.     'ridge__alpha': np.linspace(0111)  
  4. }  
  5. grid_search = GridSearchCV(  
  6.     estimator=pipe,   
  7.     param_grid=param_grid,   
  8. cv=ps)  
  9.   
  10. # Here, send the X_train and y_train  
  11. grid_search.fit(xt_df, yt)  
  12. print("Best parameters : %s" % grid_search.best_params_)  
Output:
Best parameters : {'polynomialfeatures__degree': 4, 'ridge__alpha': 0.0}



沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...