程式扎記

Source From Here
Question
I have a question about the cv parameter of sklearn's GridSearchCV. I'm working with data that has a time component to it, so I don't think random shuffling within KFold cross-validation seems sensible.

Instead, I want to explicitly specify cutoffs for training, validation, and test data within a GridSearchCV. Can I do this?

To better illuminate the question, here's how I would to that manually:

view plaincopy to clipboardprint?
import numpy as np  
import pandas as pd  
  
# Scikit-Learn for fitting models  
from sklearn.preprocessing import PolynomialFeatures  
from sklearn.linear_model import LinearRegression  
from sklearn.model_selection import cross_val_score  
from sklearn.metrics import mean_squared_error  
  
# For plotting in the notebook  
import matplotlib  
import matplotlib.pyplot as plt  
  
np.random.seed(42)  
  
# "True" generating function representing a process in real life  
def true_gen(X):  
    y = np.sin(1.2 * X * np.pi)   
    return(y)  
  
# x values and y value with a small amount of random noise  
X = np.sort(np.random.rand(120))  
y = true_gen(X) + 0.1 * np.random.randn(len(X))  
y[:20]  

Output:

view plaincopy to clipboardprint?
array([-0.04938892,  0.04475764,  0.05647044, -0.02814329,  0.15889086,  
        0.19578347,  0.17473167,  0.19376568,  0.09578607,  0.20072338,  
        0.24125928,  0.19713621,  0.27002241,  0.36786014,  0.54856942,  
        0.41307619,  0.44881168,  0.42829488,  0.25213667,  0.49932245])  

Then is for training/testing dataset:

view plaincopy to clipboardprint?
random_ind = np.random.choice(list(range(120)), size = 120, replace=False)  
xt = X[random_ind]  
yt = y[random_ind]  
  
# Training and testing observations  
train_size = int(0.7 * len(X))   
X_train = xt[:train_size]  
X_test = xt[train_size:]  
  
y_train = yt[:train_size]  
y_test = yt[train_size:]  
  
# Model the true curve  
x_linspace = np.linspace(0, 1, 1000)  
y_true = true_gen(x_linspace)  
  
# Visualize observations and true curve  
plt.plot(X_train, y_train, 'ko', label = 'Train');   
plt.plot(X_test, y_test, 'ro', label = 'Test')  
plt.plot(x_linspace, y_true, 'b-', linewidth = 2, label = 'True Function')  
plt.legend()  
plt.xlabel('x'); plt.ylabel('y'); plt.title('Data');  

So I want sklearn's GridSearchCV. to skip cross validation and use X_test/y_test to do testing instead.

HowTo
You can use the PredefinedSplit which does this very thing. So you need to make the following changes to your code:

view plaincopy to clipboardprint?
xt_df = pd.DataFrame({'x':xt})  
  
# The indices which have the value -1 will be kept in train.  
train_indices = np.full((100,), -1, dtype=int)  
train_indices  

Output:

view plaincopy to clipboardprint?
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])  

Then is for testing indice:

view plaincopy to clipboardprint?
# The indices which have zero or positive values, will be kept in test  
test_indices = np.full((20,), 0, dtype=int)  
test_fold = np.append(train_indices, test_indices)  
test_fold  

Output:

view plaincopy to clipboardprint?
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,  0,  0,  
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  
        0])  

Prepare preferred split:

view plaincopy to clipboardprint?
from sklearn.model_selection import PredefinedSplit  
ps = PredefinedSplit(test_fold)  
# Check how many splits will be done, based on test_fold  
ps.get_n_splits()  # Should output 1  
  
for train_index, test_index in ps.split():  
    print("TRAIN:", train_index, "TEST:", test_index)  

Output:

view plaincopy to clipboardprint?
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47  
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71  
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95  
96 97 98 99] TEST: [100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117  
118 119]  

view plaincopy to clipboardprint?
# And now, send this `ps` to cv param in GridSearchCV  
from sklearn.model_selection import GridSearchCV  
from sklearn.linear_model import Ridge  
from sklearn.preprocessing import PolynomialFeatures  
from sklearn.pipeline import make_pipeline  
  
features = PolynomialFeatures(degree=5, include_bias=False)  
model = Ridge(random_state=44)  
pipe = make_pipeline(  
    features,  
    model  
)  
pipe  

Output:

view plaincopy to clipboardprint?
Pipeline(steps=[('polynomialfeatures',  
                 PolynomialFeatures(degree=5, include_bias=False)),  
                ('ridge', Ridge(random_state=44))])  

view plaincopy to clipboardprint?
param_grid = {  
    'polynomialfeatures__degree': range(1, 10),  
    'ridge__alpha': np.linspace(0, 1, 11)  
}  
grid_search = GridSearchCV(  
    estimator=pipe,   
    param_grid=param_grid,   
cv=ps)  
  
# Here, send the X_train and y_train  
grid_search.fit(xt_df, yt)  
print("Best parameters : %s" % grid_search.best_params_)  

Output:

Best parameters : {'polynomialfeatures__degree': 4, 'ridge__alpha': 0.0}

程式扎記

標籤

2020年9月16日星期三

[ Python 常見問題 ] Explicitly specifying test/train sets in GridSearchCV

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2020年9月16日 星期三

[ Python 常見問題 ] Explicitly specifying test/train sets in GridSearchCV

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2020年9月16日星期三