Another useful part of the scikit-learn interface that we haven’t talked about yet is the ability of classifiers to provide uncertainty estimates of predictions.
Often, you are not only interested in which class a classifier predicts for a certain test point, but also how certain it is that this is the right class. In practice, different kinds of mistakes lead to very different outcomes in real world applications. Imagine a medical application testing for cancer. Making a false positive prediction might lead to a patient undergoing additional tests, while a false negative prediction might lead to a serious disease not being treated.
We will go into this topic in more detail in Chapter 6 (Model Selection).
There are two different functions in scikit-learn that can be used to obtain uncertainty estimates from classifiers, decision_function and predict_proba. Most (but not all) classifiers have at least one of them, and many classifiers have both. Let’s look at what these two functions do on as synthetic two-dimensional dataset, when building a GradientBoostingClassifier classifier. GradientBoostingClassifier has both a decision_function method and a predict_proba.
- ch2_t38.py
- from sklearn.ensemble import GradientBoostingClassifier
- from sklearn.datasets import make_blobs, make_circles
- from sklearn.model_selection import train_test_split
- import numpy as np
- # X, y = make_blobs(centers=2, random_state=59)
- X, y = make_circles(noise=0.25, factor=0.5, random_state=1)
- # we rename the classes "blue" and "red" for illustration purposes:
- y_named = np.array(["blue", "red"])[y]
- # we can call train test split with arbitrary many arrays
- # all will be split in a consistent manner
- X_train, X_test, y_train_named, y_test_named, y_train, y_test = \
- train_test_split(X, y_named, y, random_state=0)
- # build the gradient boosting model model
- gbrt = GradientBoostingClassifier(random_state=0)
- gbrt.fit(X_train, y_train_named)
- print("accuracy on training set: %f" % gbrt.score(X_train, y_train_named))
- print("accuracy on test set: %f" % gbrt.score(X_test, y_test_named))
In the binary classification case, the return value of decision_function is of shape (n_samples,), it returns one floating point number for each sample:
This value encodes how strongly the model believes a data point to belong to the “positive” class, in this case class 1. Positive values indicate a preference for the positive class, negative values indicate preference for the “negative”, that is the other class:
We can recover the prediction by looking only at the sign of the decision function:
For binary classification, the “negative” class is always the first entry of the classes_ attribute, and the “positive” class is the second entry of classes_. So if you want to fully recover the output of predict, you need to make use of the classes_ attribute:
The range of decision_function can be arbitrary, and depends on the data and the model parameters:
This arbitrary scaling makes the output of decision_function often hard to interpret. Below we plot the decision_function for all points in the 2d plane using a color coding, next to a visualization of the decision boundary, as we saw it in Chapter 2. We show training points as circles and test data as triangles.
- import matplotlib.pyplot as plt
- import mglearn
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- fig, axes = plt.subplots(1, 2, figsize=(13, 5))
- mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)
- scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm='bwr')
- for ax in axes:
- # plot training and test points
- ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=mglearn.cm2, s=60, marker='^')
- ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=mglearn.cm2, s=60)
- plt.colorbar(scores_image, ax=axes.tolist())
- plt.show()
Encoding not only the predicted outcome, but also how certain the classifier is provides additional information. However, in this visualization, it is hard to make out the boundary between the two classes.
Predicting probabilities
The output of predict_proba however is a probability for each class, and is often more easily understood. It is always of shape (n_samples, 2) for binary classification:
The first entry in each row is the estimated probability of the first class, the second entry is the estimated probability of the second class. Because it is a probability, the output of predict_proba is always between zero and 1, and the sum of the entries for both classes is always 1:
Because the probabilities for the two classes sum to one, exactly one of the classes is above 50% certainty. That class is the one that is predicted.
You can see in the output above, that the classifier is relatively certain for most points. How well the uncertainty actually reflects uncertainty in the data depends on the model and parameters. A model that is more overfit tends to make more certain predictions, even if they might be wrong. A model with less complexity usually has more uncertainty in predictions. A model is called calibrated if the reported uncertainty actually matches how correct it is - in a calibrated model, a prediction made with 70% certainty would be correct 70% of the time.
Below we show again the decision boundary on the dataset, next to the class probabilities for the blue class:
- fig, axes = plt.subplots(1, 2, figsize=(13, 5))
- mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)
- scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm='bwr', function='predict_proba')
- for ax in axes:
- # plot training and test points
- ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=mglearn.cm2, s=60, marker='^')
- ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=mglearn.cm2, s=60)
- plt.colorbar(scores_image, ax=axes.tolist())
- plt.show()
The boundaries in this this plot are much more well-defined, and the small areas of uncertainty are clearly visible. The scikit-learn website (Footnote: http://scikit-learn.org/stable/auto_examples/class...lot_classifier_comparison.html) has a great comparison of many models, and how their uncertainty estimates look like.
We reproduced the figure below, and encourage you to go though the example there.
Uncertainty in multi-class classification
Above we only talked about uncertainty estimates in binary classification. But the decision_function and predict_proba methods also work in the multi-class setting. Let’s apply them on the iris dataset, which is a three-class classification dataset:
- ch2_t39.py
- # create and split a synthetic dataset
- from sklearn.ensemble import GradientBoostingClassifier
- from sklearn.datasets import load_iris
- from sklearn.model_selection import train_test_split
- import numpy as np
- # X, y = make_blobs(centers=2, random_state=59)
- iris = load_iris()
- X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)
- # build the gradient boosting model model
- gbrt = GradientBoostingClassifier(learning_rate=0.01, random_state=0)
- gbrt.fit(X_train, y_train)
- print("accuracy on training set: %f" % gbrt.score(X_train, y_train))
- print("accuracy on test set: %f" % gbrt.score(X_test, y_test))
- print("shape of X_test: %s" % str(gbrt.decision_function(X_test).shape))
- print("first 6 element from decision_function:\n%s\n" % str(gbrt.decision_function(X_test[:6,:])))
The output of predict_proba has the same shape, (n_samples, n_classes). Again, the probabilities for the possible classes for each data point sum to one:
We can again recover the predictions by computing the argmax of predict_proba:
To summarize, predict_proba and decision_function always have shape (n_samples, n_classes) -- apart from the special case of decision_function in the binary case. In the binary case, decision_function only has one column, corresponding to the “positive” class classes which is mostly for historical reasons. You can recover the prediction when there are n_classes many columns by simply computing the argmax across columns.
Be careful, though, if your classes are strings, or you use integers, but they are not consecutive and starting from 0. If you want to compare results obtained with predict to results obtained via decision_function or predict_proba make sure to use the classes_ attribute of the classifier to get the actual class names.
Supplement
* Selecting the best model in scikit-learn using cross-validation
沒有留言:
張貼留言