Summary of scikit-learn methods and usage
In this chapter, we briefly recapitulate the main parts of the scikit-learn API that we have seen so far, as well as show some ways to simplify your code.
The Estimator Interface
All algorithms in scikit-learn, whether preprocessing, supervised learning or unsupervised learning algorithms are all implemented as classes. These classes are called estimators in scikit-learn. To apply an algorithm, you first have to instantiate an object of the particular class:
The estimator class contains the algorithm, and also stored the model that is learned from data using the algorithm. When constructing the model object, this is also the time when you should set any parameters of the model. These parameters include regularization, complexity control, number of clusters to find, etc, as we discussed in detail in Chapter 2 and Chapter 3.
All estimators have a fit method, which is used to build the model. The fit method always requires as first argument the data X, represented as a numpy array or a scipy sparse matrix, where each row represents a single data point. The data X is always assumed to be a numpy array or scipy sparse matrix that has continuous (floating point) entries; Supervised algorithms also require a y argument, which is a one-dimensional numpy array, containing target values for regression or classification, i.e. the known output labels or responses.
There are two main ways to apply a learned model in scikit-learn. To create a prediction in the form of a new output like y, you use the predict method. To create a new representation of the input data X, you use the transform method. Table api_summary summarizes the use-cases of the predict and transform methods.
Table api_summary
Additionally, all supervised models have a score(X_test, y_test) method, that allows an evaluation of the model. Here X_train and y_train refer to the training data and training labels, while X_test and y_test refer to the test data and test labels (if applicable).
Fit resets a model
An important property of scikit-learn models is that calling fit will always reset everything a model previously learned. So if you build a model on one dataset, and then call fit again on a different dataset, the model will “forget” everything it learned from the first data. You can call fit as often as you like on a model, and the outcome will be the same as calling fit on a “new” model:
Next, we will go into several shortcuts that allow you to write less code for common tasks, and speed up some computations. The first way to write more compact code is to make use method chaining.
Method chaining
The fit method of all scikit-learn models returns self. This allows you to write code like this:
Here, we used the return value of constructor (which is self) to assign the trained model to the variable logreg. This concatenation of method calls (here __init__ and then fit) is known as method chaining. Another common application of method chaining in scikit-learn is to fit and predict in one line:
Finally, you can even do model instantiation, fitting and predicting in one line:
This very short variant is not ideal, though. A lot is happening in a single line, which might make the code hard to read. Additionally, the fitted logistic regression model isn’t stored in any variable. So we can’t inspect it, or use it to predict on any other data.
Shortcuts and efficient alternatives
Often, you want to fit a model on some dataset, and then immediately predict on the same data, or transform it. These are very common tasks, which can often be computed more efficiently than simply calling fit and then predict or fit and then transform. For this use-case, all models that have a predict method also have a fit_predict method, and all model that have a transform method also have a fit_transform method. Here is an example using PCA:
For some unsupervised methods that we saw in Chapter 3, like some clustering and manifold learning methods, using fit_transform and fit_predict are the only options. For example DBSCAN does not have a predict method, only fit_predict, and t-SNE does not have a transform method, only fit_transform. T-SNE and DBSCAN are algorithms that can not be applied to new data, they can only be applied to the training data.
Important Attributes
scikit-learn has some standard attributes that allow you to inspect what a model learned. All these attributes are available after the call to fit, and, as we mentioned before, all attributes learned from the data are marked with a trailing underscore. We already discussed the following common attributes:
Additionally, for classifiers, classes_ contains the names of the classes the classifier was trained on, that is the unique entries of the training labels y_train:
In this chapter, we briefly recapitulate the main parts of the scikit-learn API that we have seen so far, as well as show some ways to simplify your code.
The Estimator Interface
All algorithms in scikit-learn, whether preprocessing, supervised learning or unsupervised learning algorithms are all implemented as classes. These classes are called estimators in scikit-learn. To apply an algorithm, you first have to instantiate an object of the particular class:
The estimator class contains the algorithm, and also stored the model that is learned from data using the algorithm. When constructing the model object, this is also the time when you should set any parameters of the model. These parameters include regularization, complexity control, number of clusters to find, etc, as we discussed in detail in Chapter 2 and Chapter 3.
All estimators have a fit method, which is used to build the model. The fit method always requires as first argument the data X, represented as a numpy array or a scipy sparse matrix, where each row represents a single data point. The data X is always assumed to be a numpy array or scipy sparse matrix that has continuous (floating point) entries; Supervised algorithms also require a y argument, which is a one-dimensional numpy array, containing target values for regression or classification, i.e. the known output labels or responses.
There are two main ways to apply a learned model in scikit-learn. To create a prediction in the form of a new output like y, you use the predict method. To create a new representation of the input data X, you use the transform method. Table api_summary summarizes the use-cases of the predict and transform methods.
Table api_summary
Additionally, all supervised models have a score(X_test, y_test) method, that allows an evaluation of the model. Here X_train and y_train refer to the training data and training labels, while X_test and y_test refer to the test data and test labels (if applicable).
Fit resets a model
An important property of scikit-learn models is that calling fit will always reset everything a model previously learned. So if you build a model on one dataset, and then call fit again on a different dataset, the model will “forget” everything it learned from the first data. You can call fit as often as you like on a model, and the outcome will be the same as calling fit on a “new” model:
Next, we will go into several shortcuts that allow you to write less code for common tasks, and speed up some computations. The first way to write more compact code is to make use method chaining.
Method chaining
The fit method of all scikit-learn models returns self. This allows you to write code like this:
Here, we used the return value of constructor (which is self) to assign the trained model to the variable logreg. This concatenation of method calls (here __init__ and then fit) is known as method chaining. Another common application of method chaining in scikit-learn is to fit and predict in one line:
Finally, you can even do model instantiation, fitting and predicting in one line:
This very short variant is not ideal, though. A lot is happening in a single line, which might make the code hard to read. Additionally, the fitted logistic regression model isn’t stored in any variable. So we can’t inspect it, or use it to predict on any other data.
Shortcuts and efficient alternatives
Often, you want to fit a model on some dataset, and then immediately predict on the same data, or transform it. These are very common tasks, which can often be computed more efficiently than simply calling fit and then predict or fit and then transform. For this use-case, all models that have a predict method also have a fit_predict method, and all model that have a transform method also have a fit_transform method. Here is an example using PCA:
For some unsupervised methods that we saw in Chapter 3, like some clustering and manifold learning methods, using fit_transform and fit_predict are the only options. For example DBSCAN does not have a predict method, only fit_predict, and t-SNE does not have a transform method, only fit_transform. T-SNE and DBSCAN are algorithms that can not be applied to new data, they can only be applied to the training data.
Important Attributes
scikit-learn has some standard attributes that allow you to inspect what a model learned. All these attributes are available after the call to fit, and, as we mentioned before, all attributes learned from the data are marked with a trailing underscore. We already discussed the following common attributes:
Additionally, for classifiers, classes_ contains the names of the classes the classifier was trained on, that is the unique entries of the training labels y_train:
沒有留言:
張貼留言