2017年4月6日 星期四

[ Intro2ML ] Ch7. Algorithm Chains and Pipelines - Pipeline

The General Pipeline Interface 
The Pipeline class is not restricted to preprocessing and classification, but can in fact join any number of estimators together. For example, you could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could be regression or clustering instead of classification. 

The only requirement for estimators in a pipeline is that all but the last step need to have a transform method, so they can produce a new representation of the data that can be used in the next step. Internally, during the call to Pipeline.fit, the pipeline calls fit and then transform on each step in turn, with the input given by the output of the transform method of the previous step. For the last step in the pipeline, just fit is called. 

Brushing over some finer details, this is implemented as follows. Remember that pipeline.steps is a list of tuples, so pipeline.steps[0][1] is the first estimator, pipeline.steps[1][1] is the second estimator, and so on: 
  1. def fit(self, X, y):  
  2.     X_transformed = X  
  3.     for name, estimator in self.steps[:-1]:  
  4.         # iterate over all but the final step  
  5.         # fit and transform the data  
  6.         X_transformed = estimator.fit_transform(X_transformed, y)  
  7.     # fit the last step  
  8.     self.steps[-1][1].fit(X_transformed, y)  
  9.     return self  
When predicting using Pipeline, we similarly transform the data using all but the last step, and then call predict on the last step: 
  1. def predict(self, X):  
  2.     X_transformed = X  
  3.     for step in self.steps[:-1]:  
  4.         # iterate over all but the final step  
  5.         # transform the data  
  6.         X_transformed = step[1].transform(X_transformed)  
  7.     # fit the last step  
  8.     return self.steps[-1][1].predict(X_transformed)  
The process is illustrated in Figure 7-3 for two transformers, T1 and T2, and a classifier (called Classifier). 
Figure 7-3. Overview of the pipeline training and prediction process 

The pipeline is actually even more general than this. There is no requirement for the last step in a pipeline to have a predict function, and we could create a pipeline just containing, for example, a scaler and PCA. Then, because the last step (PCA) has a transform method, we could call transform on the pipeline to get the output of PCA.transform applied to the data that was processed by the previous step. The last step of a pipeline is only required to have a fit method.

Convenient Pipeline Creation with make_pipeline 
Creating a pipeline using the syntax described earlier is sometimes a bit cumbersome, and we often don’t need user-specified names for each step. There is a convenience function, make_pipeline, that will create a pipeline for us and automatically name each step based on its class. The syntax for make_pipeline is as follows: 
  1. import numpy as np  
  2. from sklearn.pipeline import Pipeline  
  3. from sklearn.pipeline import make_pipeline  
  4. from sklearn.preprocessing import MinMaxScaler  
  5. from sklearn.svm import SVC  
  6.   
  7. # standard syntax  
  8. pipe_long = Pipeline([("scaler", MinMaxScaler()), ("svm", SVC(C=100))])  
  9. # abbreviated syntax  
  10. pipe_short = make_pipeline(MinMaxScaler(), SVC(C=100))  
The pipeline objects pipe_long and pipe_short do exactly the same thing, but pipe_short has steps that were automatically named. We can see the names of the steps by looking at the steps attribute: 
  1. print("Pipeline steps:\n{}".format(pipe_short.steps))  
Output: 
Pipeline steps:
[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
('svc', SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001,
verbose=False))]

The steps are named minmaxscaler and svc. In general, the step names are just lowercase versions of the class names. If multiple steps have the same class, a number is appended: 
  1. from sklearn.preprocessing import StandardScaler  
  2. from sklearn.decomposition import PCA  
  3.   
  4. pipe = make_pipeline(StandardScaler(), PCA(n_components=2), StandardScaler())  
  5. print("Pipeline steps:\n{}".format(pipe.steps))  
Output: 
Pipeline steps:
[('standardscaler-1', StandardScaler(copy=True, with_mean=True, with_std=True)),
('pca', PCA(copy=True, iterated_power=4, n_components=2, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)),
('standardscaler-2', StandardScaler(copy=True, with_mean=True, with_std=True))]

As you can see, the first StandardScaler step was named standardscaler-1 and the second standardscaler-2. However, in such settings it might be better to use the Pipeline construction with explicit names, to give more semantic names to each step

Accessing Step Attributes 
As we discussed earlier in this chapter, one of the main reasons to use pipelines is for doing grid searches. A common task is to access some of the steps of a pipeline inside a grid search. Let’s grid search a LogisticRegression classifier on the cancer dataset, using Pipeline and StandardScaler to scale the data before passing it to the LogisticRegression classifier. First we create a pipeline using the make_pipeline function: 
  1. from sklearn.pipeline import Pipeline  
  2. import numpy as np  
  3. from sklearn.pipeline import make_pipeline  
  4. from sklearn.preprocessing import StandardScaler  
  5. from sklearn.svm import SVC  
  6. from sklearn.linear_model import LogisticRegression  
  7. from sklearn.datasets import load_breast_cancer  
  8.   
  9. cancer = load_breast_cancer()  
  10. pipe = make_pipeline(StandardScaler(), LogisticRegression())  
Next, we create a parameter grid. As explained in Chapter 2, the regularization parameter to tune for LogisticRegression is the parameter C. We use a logarithmic grid for this parameter, searching between 0.01 and 100. Because we used the make_pipeline function, the name of the LogisticRegression step in the pipeline is the lowercased class name, logisticregression. To tune the parameter C, we therefore have to specify a parameter grid for logisticregression__C
  1. param_grid = {'logisticregression__C': [0.010.1110100]}  
  2.   
  3. from sklearn.model_selection import train_test_split  
  4. X_train, X_test, y_train, y_test = train_test_split(  
  5.     cancer.data, cancer.target, random_state=4)  
  6.   
  7. from sklearn.model_selection import GridSearchCV  
  8. grid = GridSearchCV(pipe, param_grid, cv=5)  
  9. grid.fit(X_train, y_train)  
So how do we access the coefficients of the best LogisticRegression model that was found by GridSearchCV? From Chapter 5 we know that the best model found by GridSearchCV, trained on all the training data, is stored in grid.best_estimator_
  1. print("Best estimator:\n{}".format(grid.best_estimator_))  
Output: 
Best estimator:
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))])

This best_estimator_ in our case is a pipeline with two steps, standardscaler and logisticregression. To access the logisticregression step, we can use the named_steps attribute of the pipeline, as explained earlier: 
  1. print("Logistic regression step:\n{}".format(  
  2.       grid.best_estimator_.named_steps["logisticregression"]))  
Output: 
Logistic regression step:
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)

Now that we have the trained LogisticRegression instance, we can access the coefficients (weights) associated with each input feature: 
  1. print("Logistic regression coefficients:\n{}".format(  
  2.       grid.best_estimator_.named_steps["logisticregression"].coef_))  
Output: 
Logistic regression coefficients:
[[-0.38856355 -0.37529972 -0.37624793 ...]]

This might be a somewhat lengthy expression, but often it comes in handy in understanding your models. 

Grid-Searching Preprocessing Steps and Model Parameters 
Using pipelines, we can encapsulate all the processing steps in our machine learning workflow in a single scikit-learn estimator. Another benefit of doing this is that we can now adjust the parameters of the preprocessing using the outcome of a supervised task like regression or classification. In previous chapters, we used polynomial features on the boston dataset before applying the ridge regressor. Let’s model that using a pipeline instead. The pipeline contains three steps—scaling the data, computing polynomial features, and ridge regression: 
  1. import numpy as np  
  2. from sklearn.pipeline import Pipeline  
  3. from sklearn.pipeline import make_pipeline  
  4. from sklearn.datasets import load_boston  
  5. from sklearn.model_selection import train_test_split  
  6. from sklearn.preprocessing import StandardScaler  
  7. from sklearn.linear_model import Ridge  
  8.   
  9. boston = load_boston()  
  10. X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,  
  11.                                                     random_state=0)  
  12.   
  13.   
  14. from sklearn.preprocessing import PolynomialFeatures  
  15. pipe = make_pipeline(  
  16.     StandardScaler(),  
  17.     PolynomialFeatures(),  
  18.     Ridge())  
How do we know which degrees of polynomials to choose, or whether to choose any polynomials or interactions at all? Ideally we want to select the degree parameter based on the outcome of the classification. Using our pipeline, we can search over the degree parameter together with the parameter alpha of Ridge. To do this, we define a param_grid that contains both, appropriately prefixed by the step names: 
  1. from sklearn.model_selection import GridSearchCV  
  2. param_grid = {'polynomialfeatures__degree': [123],  
  3.               'ridge__alpha': [0.0010.010.1110100]}  
  4. grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, n_jobs=-1)  
  5. grid.fit(X_train, y_train)  
We can visualize the outcome of the cross-validation using a heat map (Figure 7-4), as we did in Chapter 6: 
  1. import os  
  2. dmode = os.environ.get('DISPLAY''')  
  3. if dmode:  
  4.     import matplotlib.pyplot as plt  
  5.     plt.matshow(grid.cv_results_['mean_test_score'].reshape(3, -1),  
  6.                 vmin=0, cmap="viridis")  
  7.     plt.xlabel("ridge__alpha")  
  8.     plt.ylabel("polynomialfeatures__degree")  
  9.     plt.xticks(range(len(param_grid['ridge__alpha'])), param_grid['ridge__alpha'])  
  10.     plt.yticks(range(len(param_grid['polynomialfeatures__degree'])),  
  11.                param_grid['polynomialfeatures__degree'])  
  12.   
  13.     plt.colorbar()  
  14.     plt.show()  
Figure 7-4. Heat map of mean cross-validation score as a function of the degree of the polynomial features and alpha parameter of Ridge 

Looking at the results produced by the cross-validation, we can see that using polynomials of degree two helps, but that degree-three polynomials are much worse than either degree one or two. This is reflected in the best parameters that were found: 
  1. print("Best parameters: {}".format(grid.best_params_))  
  2. # Which lead to the following score:  
  3. print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))  
Output: 
Best parameters: {'ridge__alpha': 10, 'polynomialfeatures__degree': 2}
Test-set score: 0.77

Let’s run a grid search without polynomial features for comparison: 
  1. param_grid = {'ridge__alpha': [0.0010.010.1110100]}  
  2. pipe = make_pipeline(StandardScaler(), Ridge())  
  3. grid = GridSearchCV(pipe, param_grid, cv=5)  
  4. grid.fit(X_train, y_train)  
  5. print("Score without poly features: {:.2f}".format(grid.score(X_test, y_test)))  
Output: 
Score without poly features: 0.63

As we would expect looking at the grid search results visualized in Figure 7-4, using no polynomial features leads to decidedly worse results. Searching over preprocessing parameters together with model parameters is a very powerful strategy. However, keep in mind that GridSearchCV tries all possible combinations of the specified parameters. Therefore, adding more parameters to your grid exponentially increases the number of models that need to be built

Grid-Searching Which Model To Use 
You can even go further in combining GridSearchCV and Pipeline: it is also possible to search over the actual steps being performed in the pipeline (say whether to use StandardScaler or MinMaxScaler). This leads to an even bigger search space and should be considered carefully. Trying all possible solutions is usually not a viable machine learning strategy. However, here is an example comparing a RandomForestClassifier and an SVC on the iris dataset. We know that the SVC might need the data to be scaled, so we also search over whether to use StandardScaler or no preprocessing. For the RandomForestClassifier, we know that no preprocessing is necessary. We start by defining the pipeline. Here, we explicitly name the steps. We want two steps, one for the preprocessing and then a classifier. We can instantiate this using SVC and StandardScaler
  1. import numpy as np  
  2. from sklearn.pipeline import Pipeline  
  3. from sklearn.pipeline import make_pipeline  
  4. from sklearn.datasets import load_boston  
  5. from sklearn.model_selection import train_test_split  
  6. from sklearn.preprocessing import StandardScaler  
  7. from sklearn.linear_model import Ridge  
  8.   
  9. boston = load_boston()  
  10. X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target,  
  11.                                                     random_state=0)  
  12.   
  13. from sklearn.svm import SVC  
  14. pipe = Pipeline([('preprocessing', StandardScaler()), ('classifier', SVC())])  
Now we can define the parameter_grid to search over. We want the classifier to be either RandomForestClassifier or SVC. Because they have different parameters to tune, and need different preprocessing, we can make use of the list of search grids we discussed in “Search over spaces that are not grids”. To assign an estimator to a step, we use the name of the step as the parameter name. When we wanted to skip a step in the pipeline (for example, because we don’t need preprocessing for the RandomForest), we can set that step to None
  1. from sklearn.model_selection import GridSearchCV  
  2. from sklearn.datasets import load_breast_cancer  
  3. cancer = load_breast_cancer()  
  4. param_grid = [  
  5.     {'classifier': [SVC()], 'preprocessing': [StandardScaler(), None],  
  6.      'classifier__gamma': [0.0010.010.1110100],  
  7.      'classifier__C': [0.0010.010.1110100]},  
  8.     {'classifier': [RandomForestClassifier(n_estimators=100)],  
  9.      'preprocessing': [None], 'classifier__max_features': [123]}]  
Now we can instantiate and run the grid search as usual, here on the cancer dataset: 
  1. X_train, X_test, y_train, y_test = train_test_split(  
  2.     cancer.data, cancer.target, random_state=0)  
  3.   
  4. grid = GridSearchCV(pipe, param_grid, cv=5)  
  5. grid.fit(X_train, y_train)  
  6.   
  7. print("Best params:\n{}\n".format(grid.best_params_))  
  8. print("Best cross-validation score: {:.2f}".format(grid.best_score_))  
  9. print("Test-set score: {:.2f}".format(grid.score(X_test, y_test)))  
Output: 
Best params:
{'classifier__gamma': 0.01, 'preprocessing': StandardScaler(copy=True, with_mean=True, with_std=True), 'classifier': SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False), 'classifier__C': 10}

Best cross-validation score: 0.99
Test-set score: 0.98

The outcome of the grid search is that SVC with StandardScaler preprocessing, C=10, and gamma=0.01 gave the best result.

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...