The Pipeline class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification.
The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step. Internally, during the call to Pipeline.fit, the pipeline calls fit and then transform on each step in turn, with the input given by the output of the transform method of the previous step. For the last step in the pipeline, just fit is called.
Brushing over some finer details, this is implemented as follows. Remember that pipeline.steps is a list of tuples, so pipeline.steps[0][1] is the first estimator, pipeline.steps[1][1] is the second estimator, and so on:
- def fit(self, X, y):
- X_transformed = X
- for name, estimator in self.steps[:-1]:
- # iterate over all but the final step
- # fit and transform the data
- X_transformed = estimator.fit_transform(X_transformed, y)
- # fit the last step
- self.steps[-1][1].fit(X_transformed, y)
- return self
- def predict(self, X):
- X_transformed = X
- for step in self.steps[:-1]:
- # iterate over all but the final step
- # transform the data
- X_transformed = step[1].transform(X_transformed)
- # fit the last step
- return self.steps[-1][1].predict(X_transformed)
Figure 7-3. Overview of the pipeline training and prediction process
The pipeline is actually even more general than this. There is no requirement for the last step in a pipeline to have a predict function, and we could create a pipeline just containing, for example, a scaler and PCA. Then, because the last step (PCA) has a transform method, we could call transform on the pipeline to get the output of PCA.transform applied to the data that was processed by the previous step. The last step of a pipeline is only required to have a fit method.
Convenient Pipeline Creation with make_pipeline
Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, make_pipeline, that will create a pipeline for us and automatically name each step based on its class. The syntax for make_pipeline is as follows:
- import numpy as np
- from sklearn.pipeline import Pipeline
- from sklearn.pipeline import make_pipeline
- from sklearn.preprocessing import MinMaxScaler
- from sklearn.svm import SVC
- # standard syntax
- pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])
- # abbreviated syntax
- pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))
- print("Pipeline steps:\n{}".format(pipe_short.steps))
The steps are named minmaxscaler and svc. In general, the step names are just lowercase versions of the class names. If multiple steps have the same class, a number is appended:
- from sklearn.preprocessing import StandardScaler
- from sklearn.decomposition import PCA
- pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())
- print("Pipeline steps:\n{}".format(pipe.steps))
As you can see, the first StandardScaler step was named standardscaler-1 and the second standardscaler-2. However, in such settings it might be better to use the Pipeline construction with explicit names, to give more semantic names to each step.
Accessing Step Attributes
As we discussed earlier in this chapter, one of the main reasons to use pipelines is for doing grid searches. A common task is to access some of the steps of a pipeline inside a grid search. Let’s grid search a LogisticRegression classifier on the cancer dataset, using Pipeline and StandardScaler to scale the data before passing it to the LogisticRegression classifier. First we create a pipeline using the make_pipeline function:
- from sklearn.pipeline import Pipeline
- import numpy as np
- from sklearn.pipeline import make_pipeline
- from sklearn.preprocessing import StandardScaler
- from sklearn.svm import SVC
- from sklearn.linear_model import LogisticRegression
- from sklearn.datasets import load_breast_cancer
- cancer = load_breast_cancer()
- pipe = make_pipeline(StandardScaler(), LogisticRegression())
- param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(
- cancer.data, cancer.target, random_state=4)
- from sklearn.model_selection import GridSearchCV
- grid = GridSearchCV(pipe, param_grid, cv=5)
- grid.fit(X_train, y_train)
- print("Best estimator:\n{}".format(grid.best_estimator_))
This best_estimator_ in our case is a pipeline with two steps, standardscaler and logisticregression. To access the logisticregression step, we can use the named_steps attribute of the pipeline, as explained earlier:
- print("Logistic regression step:\n{}".format(
- grid.best_estimator_.named_steps["logisticregression"]))
Now that we have the trained LogisticRegression instance, we can access the coefficients (weights) associated with each input feature:
- print("Logistic regression coefficients:\n{}".format(
- grid.best_estimator_.named_steps["logisticregression"].coef_))
This might be a somewhat lengthy expression, but often it comes in handy in understanding your models.
Grid-Searching Preprocessing Steps and Model Parameters
Using pipelines, we can encapsulate all the processing steps in our machine learning workflow in a single scikit-learn estimator. Another benefit of doing this is that we can now adjust the parameters of the preprocessing using the outcome of a supervised task like regression or classification. In previous chapters, we used polynomial features on the boston dataset before applying the ridge regressor. Let’s model that using a pipeline instead. The pipeline contains three steps—scaling the data, computing polynomial features, and ridge regression:
- import numpy as np
- from sklearn.pipeline import Pipeline
- from sklearn.pipeline import make_pipeline
- from sklearn.datasets import load_boston
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import StandardScaler
- from sklearn.linear_model import Ridge
- boston = load_boston()
- X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,
- random_state=0)
- from sklearn.preprocessing import PolynomialFeatures
- pipe = make_pipeline(
- StandardScaler(),
- PolynomialFeatures(),
- Ridge())
- from sklearn.model_selection import GridSearchCV
- param_grid = {'polynomialfeatures__degree': [1, 2, 3],
- 'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
- grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)
- grid.fit(X_train, y_train)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- import matplotlib.pyplot as plt
- plt.matshow(grid.cv_results_['mean_test_score'].reshape(3, -1),
- vmin=0, cmap="viridis")
- plt.xlabel("ridge__alpha")
- plt.ylabel("polynomialfeatures__degree")
- plt.xticks(range(len(param_grid['ridge__alpha'])), param_grid['ridge__alpha'])
- plt.yticks(range(len(param_grid['polynomialfeatures__degree'])),
- param_grid['polynomialfeatures__degree'])
- plt.colorbar()
- plt.show()
Looking at the results produced by the cross-validation, we can see that using polynomials of degree two helps, but that degree-three polynomials are much worse than either degree one or two. This is reflected in the best parameters that were found:
- print("Best parameters: {}".format(grid.best_params_))
- # Which lead to the following score:
- print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))
Let’s run a grid search without polynomial features for comparison:
- param_grid = {'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
- pipe = make_pipeline(StandardScaler(), Ridge())
- grid = GridSearchCV(pipe, param_grid, cv=5)
- grid.fit(X_train, y_train)
- print("Score without poly features: {:.2f}".format(grid.score(X_test, y_test)))
As we would expect looking at the grid search results visualized in Figure 7-4, using no polynomial features leads to decidedly worse results. Searching over preprocessing parameters together with model parameters is a very powerful strategy. However, keep in mind that GridSearchCV tries all possible combinations of the specified parameters. Therefore, adding more parameters to your grid exponentially increases the number of models that need to be built.
Grid-Searching Which Model To Use
You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC on the iris dataset. We know that the SVC might need the data to be scaled, so we also search over whether to use StandardScaler or no preprocessing. For the RandomForestClassifier, we know that no preprocessing is necessary. We start by defining the pipeline. Here, we explicitly name the steps. We want two steps, one for the preprocessing and then a classifier. We can instantiate this using SVC and StandardScaler:
- import numpy as np
- from sklearn.pipeline import Pipeline
- from sklearn.pipeline import make_pipeline
- from sklearn.datasets import load_boston
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import StandardScaler
- from sklearn.linear_model import Ridge
- boston = load_boston()
- X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,
- random_state=0)
- from sklearn.svm import SVC
- pipe = Pipeline([('preprocessing', StandardScaler()), ('classifier', SVC())])
- from sklearn.model_selection import GridSearchCV
- from sklearn.datasets import load_breast_cancer
- cancer = load_breast_cancer()
- param_grid = [
- {'classifier': [SVC()], 'preprocessing': [StandardScaler(), None],
- 'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100],
- 'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]},
- {'classifier': [RandomForestClassifier(n_estimators=100)],
- 'preprocessing': [None], 'classifier__max_features': [1, 2, 3]}]
- X_train, X_test, y_train, y_test = train_test_split(
- cancer.data, cancer.target, random_state=0)
- grid = GridSearchCV(pipe, param_grid, cv=5)
- grid.fit(X_train, y_train)
- print("Best params:\n{}\n".format(grid.best_params_))
- print("Best cross-validation score: {:.2f}".format(grid.best_score_))
- print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))
The outcome of the grid search is that SVC with StandardScaler preprocessing, C=10, and gamma=0.01 gave the best result.
沒有留言:
張貼留言