程式扎記: [ Intro2ML ] Ch7. Algorithm Chains and Pipelines

The General Pipeline Interface
The Pipeline class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification.

The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step. Internally, during the call to Pipeline.fit, the pipeline calls fit and then transform on each step in turn, with the input given by the output of the transform method of the previous step. For the last step in the pipeline, just fit is called.

Brushing over some finer details, this is implemented as follows. Remember that pipeline.steps is a list of tuples, so pipeline.steps[0][1] is the first estimator, pipeline.steps[1][1] is the second estimator, and so on:

view plaincopy to clipboardprint?
def fit(self, X, y):  
    X_transformed = X  
    for name, estimator in self.steps[:-1]:  
        # iterate over all but the final step  
        # fit and transform the data  
        X_transformed = estimator.fit_transform(X_transformed, y)  
    # fit the last step  
    self.steps[-1][1].fit(X_transformed, y)  
    return self  

When predicting using Pipeline, we similarly transform the data using all but the last step, and then call predict on the last step:

view plaincopy to clipboardprint?
def predict(self, X):  
    X_transformed = X  
    for step in self.steps[:-1]:  
        # iterate over all but the final step  
        # transform the data  
        X_transformed = step[1].transform(X_transformed)  
    # fit the last step  
    return self.steps[-1][1].predict(X_transformed)  

The process is illustrated in Figure 7-3 for two transformers, T1 and T2, and a classifier (called Classifier).

Figure 7-3. Overview of the pipeline training and prediction process

The pipeline is actually even more general than this. There is no requirement for the last step in a pipeline to have a predict function, and we could create a pipeline just containing, for example, a scaler and PCA. Then, because the last step (PCA) has a transform method, we could call transform on the pipeline to get the output of PCA.transform applied to the data that was processed by the previous step. The last step of a pipeline is only required to have a fit method.

Convenient Pipeline Creation with make_pipeline
Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, make_pipeline, that will create a pipeline for us and automatically name each step based on its class. The syntax for make_pipeline is as follows:

view plaincopy to clipboardprint?
import numpy as np  
from sklearn.pipeline import Pipeline  
from sklearn.pipeline import make_pipeline  
from sklearn.preprocessing import MinMaxScaler  
from sklearn.svm import SVC  
  
# standard syntax  
pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])  
# abbreviated syntax  
pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))  

The pipeline objects pipe_long and pipe_short do exactly the same thing, but pipe_short has steps that were automatically named. We can see the names of the steps by looking at the steps attribute:

view plaincopy to clipboardprint?
print("Pipeline steps:\n{}".format(pipe_short.steps))  

Output:

Pipeline steps:
[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
('svc', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001,
verbose=False))]

The steps are named minmaxscaler and svc. In general, the step names are just lowercase versions of the class names. If multiple steps have the same class, a number is appended:

view plaincopy to clipboardprint?
from sklearn.preprocessing import StandardScaler  
from sklearn.decomposition import PCA  
  
pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())  
print("Pipeline steps:\n{}".format(pipe.steps))  

Output:

Pipeline steps:
[('standardscaler-1', StandardScaler(copy=True, with_mean=True, with_std=True)),
('pca', PCA(copy=True, iterated_power=4, n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)),
('standardscaler-2', StandardScaler(copy=True, with_mean=True, with_std=True))]

As you can see, the first StandardScaler step was named standardscaler-1 and the second standardscaler-2. However, in such settings it might be better to use the Pipeline construction with explicit names, to give more semantic names to each step.

Accessing Step Attributes
As we discussed earlier in this chapter, one of the main reasons to use pipelines is for doing grid searches. A common task is to access some of the steps of a pipeline inside a grid search. Let’s grid search a LogisticRegression classifier on the cancer dataset, using Pipeline and StandardScaler to scale the data before passing it to the LogisticRegression classifier. First we create a pipeline using the make_pipeline function:

view plaincopy to clipboardprint?
from sklearn.pipeline import Pipeline  
import numpy as np  
from sklearn.pipeline import make_pipeline  
from sklearn.preprocessing import StandardScaler  
from sklearn.svm import SVC  
from sklearn.linear_model import LogisticRegression  
from sklearn.datasets import load_breast_cancer  
  
cancer = load_breast_cancer()  
pipe = make_pipeline(StandardScaler(), LogisticRegression())  

Next, we create a parameter grid. As explained in Chapter 2, the regularization parameter to tune for LogisticRegression is the parameter C. We use a logarithmic grid for this parameter, searching between 0.01 and 100. Because we used the make_pipeline function, the name of the LogisticRegression step in the pipeline is the lowercased class name, logisticregression. To tune the parameter C, we therefore have to specify a parameter grid for logisticregression__C:

view plaincopy to clipboardprint?
param_grid = {'logisticregression__C': [0.01, 0.1, 1, 10, 100]}  
  
from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(  
    cancer.data, cancer.target, random_state=4)  
  
from sklearn.model_selection import GridSearchCV  
grid = GridSearchCV(pipe, param_grid, cv=5)  
grid.fit(X_train, y_train)  

So how do we access the coefficients of the best LogisticRegression model that was found by GridSearchCV? From Chapter 5 we know that the best model found by GridSearchCV, trained on all the training data, is stored in grid.best_estimator_:

view plaincopy to clipboardprint?
print("Best estimator:\n{}".format(grid.best_estimator_))  

Output:

Best estimator:
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))])

This best_estimator_ in our case is a pipeline with two steps, standardscaler and logisticregression. To access the logisticregression step, we can use the named_steps attribute of the pipeline, as explained earlier:

view plaincopy to clipboardprint?
print("Logistic regression step:\n{}".format(  
      grid.best_estimator_.named_steps["logisticregression"]))  

Output:

Logistic regression step:
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

Now that we have the trained LogisticRegression instance, we can access the coefficients (weights) associated with each input feature:

view plaincopy to clipboardprint?
print("Logistic regression coefficients:\n{}".format(  
      grid.best_estimator_.named_steps["logisticregression"].coef_))  

Output:

Logistic regression coefficients:
[[-0.38856355 -0.37529972 -0.37624793 ...]]

This might be a somewhat lengthy expression, but often it comes in handy in understanding your models.

Grid-Searching Preprocessing Steps and Model Parameters
Using pipelines, we can encapsulate all the processing steps in our machine learning workflow in a single scikit-learn estimator. Another benefit of doing this is that we can now adjust the parameters of the preprocessing using the outcome of a supervised task like regression or classification. In previous chapters, we used polynomial features on the boston dataset before applying the ridge regressor. Let’s model that using a pipeline instead. The pipeline contains three steps—scaling the data, computing polynomial features, and ridge regression:

view plaincopy to clipboardprint?
import numpy as np  
from sklearn.pipeline import Pipeline  
from sklearn.pipeline import make_pipeline  
from sklearn.datasets import load_boston  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.linear_model import Ridge  
  
boston = load_boston()  
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,  
                                                    random_state=0)  
  
  
from sklearn.preprocessing import PolynomialFeatures  
pipe = make_pipeline(  
    StandardScaler(),  
    PolynomialFeatures(),  
    Ridge())  

How do we know which degrees of polynomials to choose, or whether to choose any polynomials or interactions at all? Ideally we want to select the degree parameter based on the outcome of the classification. Using our pipeline, we can search over the degree parameter together with the parameter alpha of Ridge. To do this, we define a param_grid that contains both, appropriately prefixed by the step names:

view plaincopy to clipboardprint?
from sklearn.model_selection import GridSearchCV  
param_grid = {'polynomialfeatures__degree': [1, 2, 3],  
              'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}  
grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)  
grid.fit(X_train, y_train)  

We can visualize the outcome of the cross-validation using a heat map (Figure 7-4), as we did in Chapter 6:

view plaincopy to clipboardprint?
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    import matplotlib.pyplot as plt  
    plt.matshow(grid.cv_results_['mean_test_score'].reshape(3, -1),  
                vmin=0, cmap="viridis")  
    plt.xlabel("ridge__alpha")  
    plt.ylabel("polynomialfeatures__degree")  
    plt.xticks(range(len(param_grid['ridge__alpha'])), param_grid['ridge__alpha'])  
    plt.yticks(range(len(param_grid['polynomialfeatures__degree'])),  
               param_grid['polynomialfeatures__degree'])  
  
    plt.colorbar()  
    plt.show()  

Figure 7-4. Heat map of mean cross-validation score as a function of the degree of the polynomial features and alpha parameter of Ridge

Looking at the results produced by the cross-validation, we can see that using polynomials of degree two helps, but that degree-three polynomials are much worse than either degree one or two. This is reflected in the best parameters that were found:

view plaincopy to clipboardprint?
print("Best parameters: {}".format(grid.best_params_))  
# Which lead to the following score:  
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))  

Output:

Best parameters: {'ridge__alpha': 10, 'polynomialfeatures__degree': 2}
Test-set score: 0.77

Let’s run a grid search without polynomial features for comparison:

view plaincopy to clipboardprint?
param_grid = {'ridge__alpha': [0.001, 0.01, 0.1, 1, 10, 100]}  
pipe = make_pipeline(StandardScaler(), Ridge())  
grid = GridSearchCV(pipe, param_grid, cv=5)  
grid.fit(X_train, y_train)  
print("Score without poly features: {:.2f}".format(grid.score(X_test, y_test)))  

Output:

Score without poly features: 0.63

As we would expect looking at the grid search results visualized in Figure 7-4, using no polynomial features leads to decidedly worse results. Searching over preprocessing parameters together with model parameters is a very powerful strategy. However, keep in mind that GridSearchCV tries all possible combinations of the specified parameters. Therefore, adding more parameters to your grid exponentially increases the number of models that need to be built.

Grid-Searching Which Model To Use
You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC on the iris dataset. We know that the SVC might need the data to be scaled, so we also search over whether to use StandardScaler or no preprocessing. For the RandomForestClassifier, we know that no preprocessing is necessary. We start by defining the pipeline. Here, we explicitly name the steps. We want two steps, one for the preprocessing and then a classifier. We can instantiate this using SVC and StandardScaler:

view plaincopy to clipboardprint?
import numpy as np  
from sklearn.pipeline import Pipeline  
from sklearn.pipeline import make_pipeline  
from sklearn.datasets import load_boston  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  
from sklearn.linear_model import Ridge  
  
boston = load_boston()  
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,  
                                                    random_state=0)  
  
from sklearn.svm import SVC  
pipe = Pipeline([('preprocessing', StandardScaler()), ('classifier', SVC())])  

Now we can define the parameter_grid to search over. We want the classifier to be either RandomForestClassifier or SVC. Because they have different parameters to tune, and need different preprocessing, we can make use of the list of search grids we discussed in “Search over spaces that are not grids”. To assign an estimator to a step, we use the name of the step as the parameter name. When we wanted to skip a step in the pipeline (for example, because we don’t need preprocessing for the RandomForest), we can set that step to None:

view plaincopy to clipboardprint?
from sklearn.model_selection import GridSearchCV  
from sklearn.datasets import load_breast_cancer  
cancer = load_breast_cancer()  
param_grid = [  
    {'classifier': [SVC()], 'preprocessing': [StandardScaler(), None],  
     'classifier__gamma': [0.001, 0.01, 0.1, 1, 10, 100],  
     'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100]},  
    {'classifier': [RandomForestClassifier(n_estimators=100)],  
     'preprocessing': [None], 'classifier__max_features': [1, 2, 3]}]  

Now we can instantiate and run the grid search as usual, here on the cancer dataset:

view plaincopy to clipboardprint?
X_train, X_test, y_train, y_test = train_test_split(  
    cancer.data, cancer.target, random_state=0)  
  
grid = GridSearchCV(pipe, param_grid, cv=5)  
grid.fit(X_train, y_train)  
  
print("Best params:\n{}\n".format(grid.best_params_))  
print("Best cross-validation score: {:.2f}".format(grid.best_score_))  
print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))  

Output:

Best params:
{'classifier__gamma': 0.01, 'preprocessing': StandardScaler(copy=True, with_mean=True, with_std=True), 'classifier': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False), 'classifier__C': 10}

Best cross-validation score: 0.99
Test-set score: 0.98

The outcome of the grid search is that SVC with StandardScaler preprocessing, C=10, and gamma=0.01 gave the best result.

程式扎記

標籤

2017年4月6日星期四

[ Intro2ML ] Ch7. Algorithm Chains and Pipelines - Pipeline

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年4月6日 星期四

[ Intro2ML ] Ch7. Algorithm Chains and Pipelines - Pipeline

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2017年4月6日星期四