Source From Here
Question
I have a question about the cv parameter of sklearn's GridSearchCV. I'm working with data that has a time component to it, so I don't think random shuffling within KFold cross-validation seems sensible.
Instead, I want to explicitly specify cutoffs for training, validation, and test data within a GridSearchCV. Can I do this?
To better illuminate the question, here's how I would to that manually:
Output:
Then is for training/testing dataset:
So I want sklearn's GridSearchCV. to skip cross validation and use X_test/y_test to do testing instead.
HowTo
You can use the PredefinedSplit which does this very thing. So you need to make the following changes to your code:
Output:
Then is for testing indice:
Output:
Prepare preferred split:
Output:
Next
Output:
Next:
Output:
Question
I have a question about the cv parameter of sklearn's GridSearchCV. I'm working with data that has a time component to it, so I don't think random shuffling within KFold cross-validation seems sensible.
Instead, I want to explicitly specify cutoffs for training, validation, and test data within a GridSearchCV. Can I do this?
To better illuminate the question, here's how I would to that manually:
- import numpy as np
- import pandas as pd
- # Scikit-Learn for fitting models
- from sklearn.preprocessing import PolynomialFeatures
- from sklearn.linear_model import LinearRegression
- from sklearn.model_selection import cross_val_score
- from sklearn.metrics import mean_squared_error
- # For plotting in the notebook
- import matplotlib
- import matplotlib.pyplot as plt
- np.random.seed(42)
- # "True" generating function representing a process in real life
- def true_gen(X):
- y = np.sin(1.2 * X * np.pi)
- return(y)
- # x values and y value with a small amount of random noise
- X = np.sort(np.random.rand(120))
- y = true_gen(X) + 0.1 * np.random.randn(len(X))
- y[:20]
- array([-0.04938892, 0.04475764, 0.05647044, -0.02814329, 0.15889086,
- 0.19578347, 0.17473167, 0.19376568, 0.09578607, 0.20072338,
- 0.24125928, 0.19713621, 0.27002241, 0.36786014, 0.54856942,
- 0.41307619, 0.44881168, 0.42829488, 0.25213667, 0.49932245])
- random_ind = np.random.choice(list(range(120)), size = 120, replace=False)
- xt = X[random_ind]
- yt = y[random_ind]
- # Training and testing observations
- train_size = int(0.7 * len(X))
- X_train = xt[:train_size]
- X_test = xt[train_size:]
- y_train = yt[:train_size]
- y_test = yt[train_size:]
- # Model the true curve
- x_linspace = np.linspace(0, 1, 1000)
- y_true = true_gen(x_linspace)
- # Visualize observations and true curve
- plt.plot(X_train, y_train, 'ko', label = 'Train');
- plt.plot(X_test, y_test, 'ro', label = 'Test')
- plt.plot(x_linspace, y_true, 'b-', linewidth = 2, label = 'True Function')
- plt.legend()
- plt.xlabel('x'); plt.ylabel('y'); plt.title('Data');
So I want sklearn's GridSearchCV. to skip cross validation and use X_test/y_test to do testing instead.
HowTo
You can use the PredefinedSplit which does this very thing. So you need to make the following changes to your code:
- xt_df = pd.DataFrame({'x':xt})
- # The indices which have the value -1 will be kept in train.
- train_indices = np.full((100,), -1, dtype=int)
- train_indices
- array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1])
- # The indices which have zero or positive values, will be kept in test
- test_indices = np.full((20,), 0, dtype=int)
- test_fold = np.append(train_indices, test_indices)
- test_fold
- array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
- -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0,
- 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
- 0])
- from sklearn.model_selection import PredefinedSplit
- ps = PredefinedSplit(test_fold)
- # Check how many splits will be done, based on test_fold
- ps.get_n_splits() # Should output 1
- for train_index, test_index in ps.split():
- print("TRAIN:", train_index, "TEST:", test_index)
- TRAIN: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
- 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
- 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
- 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
- 96 97 98 99] TEST: [100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117
- 118 119]
- # And now, send this `ps` to cv param in GridSearchCV
- from sklearn.model_selection import GridSearchCV
- from sklearn.linear_model import Ridge
- from sklearn.preprocessing import PolynomialFeatures
- from sklearn.pipeline import make_pipeline
- features = PolynomialFeatures(degree=5, include_bias=False)
- model = Ridge(random_state=44)
- pipe = make_pipeline(
- features,
- model
- )
- pipe
- Pipeline(steps=[('polynomialfeatures',
- PolynomialFeatures(degree=5, include_bias=False)),
- ('ridge', Ridge(random_state=44))])
- param_grid = {
- 'polynomialfeatures__degree': range(1, 10),
- 'ridge__alpha': np.linspace(0, 1, 11)
- }
- grid_search = GridSearchCV(
- estimator=pipe,
- param_grid=param_grid,
- cv=ps)
- # Here, send the X_train and y_train
- grid_search.fit(xt_df, yt)
- print("Best parameters : %s" % grid_search.best_params_)
Best parameters : {'polynomialfeatures__degree': 4, 'ridge__alpha': 0.0}
沒有留言:
張貼留言