程式扎記: [ Intro2ML ] Ch2. Supervised Learning - Uncertainty estimates from classifiers

Uncertainty estimates from classifiers
Another useful part of the scikit-learn interface that we haven’t talked about yet is the ability of classifiers to provide uncertainty estimates of predictions.

Often, you are not only interested in which class a classifier predicts for a certain test point, but also how certain it is that this is the right class. In practice, different kinds of mistakes lead to very different outcomes in real world applications. Imagine a medical application testing for cancer. Making a false positive prediction might lead to a patient undergoing additional tests, while a false negative prediction might lead to a serious disease not being treated.

We will go into this topic in more detail in Chapter 6 (Model Selection).

There are two different functions in scikit-learn that can be used to obtain uncertainty estimates from classifiers, decision_function and predict_proba. Most (but not all) classifiers have at least one of them, and many classifiers have both. Let’s look at what these two functions do on as synthetic two-dimensional dataset, when building a GradientBoostingClassifier classifier. GradientBoostingClassifier has both a decision_function method and a predict_proba.
- ch2_t38.py

view plaincopy to clipboardprint?
from sklearn.ensemble import GradientBoostingClassifier  
from sklearn.datasets import make_blobs, make_circles  
from sklearn.model_selection import train_test_split  
import numpy as np  
# X, y = make_blobs(centers=2, random_state=59)  
X, y = make_circles(noise=0.25, factor=0.5, random_state=1)  
  
# we rename the classes "blue" and "red" for illustration purposes:  
y_named = np.array(["blue", "red"])[y]  
  
# we can call train test split with arbitrary many arrays  
# all will be split in a consistent manner  
X_train, X_test, y_train_named, y_test_named, y_train, y_test = \  
train_test_split(X, y_named, y, random_state=0)  
# build the gradient boosting model model  
gbrt = GradientBoostingClassifier(random_state=0)  
gbrt.fit(X_train, y_train_named)  
  
print("accuracy on training set: %f" % gbrt.score(X_train, y_train_named))  
print("accuracy on test set: %f" % gbrt.score(X_test, y_test_named))  

The Decision Function
In the binary classification case, the return value of decision_function is of shape (n_samples,), it returns one floating point number for each sample:

>>> from ch2_t38 import *
accuracy on training set: 1.000000
accuracy on test set: 0.840000
>>> print X_test.shape
(25, 2)
>>> print gbrt.decision_function(X_test).shape
(25,)

This value encodes how strongly the model believes a data point to belong to the “positive” class, in this case class 1. Positive values indicate a preference for the positive class, negative values indicate preference for the “negative”, that is the other class:

>>> gbrt.decision_function(X_test)[:6] # show the first few entries of decision_function
array([ 4.13592629, -1.7016989 , -3.95106099, -3.62599351, 4.28986668, 3.66166106])

We can recover the prediction by looking only at the sign of the decision function:

>>> print(gbrt.decision_function(X_test) > 0)
[ True False False...]
>>> print(gbrt.predict(X_test))
['red' 'blue' 'blue' ...]

For binary classification, the “negative” class is always the first entry of the classes_ attribute, and the “positive” class is the second entry of classes_. So if you want to fully recover the output of predict, you need to make use of the classes_ attribute:

>>> greater_zero = (gbrt.decision_function(X_test) > 0).astype(int) # make the boolean True/False into 0 and 1
>>> pred = gbrt.classes_[greater_zero] # use 0 and 1 as indices into classes_
>>> np.all(pred == gbrt.predict(X_test)) # pred is the same as the output of gbrt.predict
True

The range of decision_function can be arbitrary, and depends on the data and the model parameters:

>>> np.min(decision_function), np.max(decision_function)
(-7.6909717730121798, 4.289866676868515)

This arbitrary scaling makes the output of decision_function often hard to interpret. Below we plot the decision_function for all points in the 2d plane using a color coding, next to a visualization of the decision boundary, as we saw it in Chapter 2. We show training points as circles and test data as triangles.

view plaincopy to clipboardprint?
import matplotlib.pyplot as plt  
import mglearn  
import os  
dmode = os.environ.get('DISPLAY', '')  
if dmode:  
    fig, axes = plt.subplots(1, 2, figsize=(13, 5))  
    mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)  
    scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm='bwr')  
    for ax in axes:  
    # plot training and test points  
        ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=mglearn.cm2, s=60, marker='^')  
        ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=mglearn.cm2, s=60)  
    plt.colorbar(scores_image, ax=axes.tolist())  
    plt.show()  

Encoding not only the predicted outcome, but also how certain the classifier is provides additional information. However, in this visualization, it is hard to make out the boundary between the two classes.

Predicting probabilities
The output of predict_proba however is a probability for each class, and is often more easily understood. It is always of shape (n_samples, 2) for binary classification:

>>> from ch2_t38 import *
>>> gbrt.predict_proba(X_test).shape
(25, 2)

The first entry in each row is the estimated probability of the first class, the second entry is the estimated probability of the second class. Because it is a probability, the output of predict_proba is always between zero and 1, and the sum of the entries for both classes is always 1:

>>> np.set_printoptions(suppress=True, precision=3)
>>> gbrt.predict_proba(X_test[:6]) # show the first few entries of predict_proba
array([[ 0.016, 0.984],
[ 0.846, 0.154],
[ 0.981, 0.019],
...]

Because the probabilities for the two classes sum to one, exactly one of the classes is above 50% certainty. That class is the one that is predicted.

You can see in the output above, that the classifier is relatively certain for most points. How well the uncertainty actually reflects uncertainty in the data depends on the model and parameters. A model that is more overfit tends to make more certain predictions, even if they might be wrong. A model with less complexity usually has more uncertainty in predictions. A model is called calibrated if the reported uncertainty actually matches how correct it is - in a calibrated model, a prediction made with 70% certainty would be correct 70% of the time.

Below we show again the decision boundary on the dataset, next to the class probabilities for the blue class:

view plaincopy to clipboardprint?
fig, axes = plt.subplots(1, 2, figsize=(13, 5))  
mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)  
scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm='bwr', function='predict_proba')  
for ax in axes:  
# plot training and test points  
    ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=mglearn.cm2, s=60, marker='^')  
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=mglearn.cm2, s=60)  
plt.colorbar(scores_image, ax=axes.tolist())  
plt.show()  

The boundaries in this this plot are much more well-defined, and the small areas of uncertainty are clearly visible. The scikit-learn website (Footnote: http://scikit-learn.org/stable/auto_examples/class...lot_classifier_comparison.html) has a great comparison of many models, and how their uncertainty estimates look like.

We reproduced the figure below, and encourage you to go though the example there.

Uncertainty in multi-class classification
Above we only talked about uncertainty estimates in binary classification. But the decision_function and predict_proba methods also work in the multi-class setting. Let’s apply them on the iris dataset, which is a three-class classification dataset:
- ch2_t39.py

view plaincopy to clipboardprint?
# create and split a synthetic dataset  
from sklearn.ensemble import GradientBoostingClassifier  
from sklearn.datasets import load_iris  
from sklearn.model_selection import train_test_split  
import numpy as np  
# X, y = make_blobs(centers=2, random_state=59)  
iris = load_iris()  
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)  
  
# build the gradient boosting model model  
gbrt = GradientBoostingClassifier(learning_rate=0.01, random_state=0)  
gbrt.fit(X_train, y_train)  
  
print("accuracy on training set: %f" % gbrt.score(X_train, y_train))  
print("accuracy on test set: %f" % gbrt.score(X_test, y_test))  
print("shape of X_test: %s" % str(gbrt.decision_function(X_test).shape))  
print("first 6 element from decision_function:\n%s\n" % str(gbrt.decision_function(X_test[:6,:])))  

In the multi-class case, the decision_function has the shape (n_samples, n_classes), and each column provides a “certainty score” for each class, where a large score means that a class is more likely, and a small score means the class is less likely. You can recover the prediction from these scores by finding the maximum entry for each data point:

>>> from ch2_t39 import *
>>> print(np.argmax(gbrt.decision_function(X_test), axis=1))
[1 0 2 1 1 0 1 2 ...]
>>> print(gbrt.predict(X_test))
[1 0 2 1 1 0 1 2 1 1 2 ...]

The output of predict_proba has the same shape, (n_samples, n_classes). Again, the probabilities for the possible classes for each data point sum to one:

>>> print(gbrt.predict_proba(X_test)[:3]) # show the first three entries of predict_proba
[[ 0.10664722 0.7840248 0.10932798]
[ 0.78880668 0.10599243 0.10520089]
[ 0.10231173 0.10822274 0.78946553]]
>>> print("sums: %s" % gbrt.predict_proba(X_test)[:3].sum(axis=1)) # show that sums across rows are one
sums: [ 1. 1. 1.]

We can again recover the predictions by computing the argmax of predict_proba:

>>> print(np.argmax(gbrt.decision_function(X_test[:3]), axis=1))
[1 0 2]
>>> print(gbrt.predict(X_test[:3]))
[1 0 2]
>>> gbrt.classes_
array([0, 1, 2])

To summarize, predict_proba and decision_function always have shape (n_samples, n_classes) -- apart from the special case of decision_function in the binary case. In the binary case, decision_function only has one column, corresponding to the “positive” class classes which is mostly for historical reasons. You can recover the prediction when there are n_classes many columns by simply computing the argmax across columns.

Be careful, though, if your classes are strings, or you use integers, but they are not consecutive and starting from 0. If you want to compare results obtained with predict to results obtained via decision_function or predict_proba make sure to use the classes_ attribute of the classifier to get the actual class names.

Supplement
* Selecting the best model in scikit-learn using cross-validation

程式扎記

標籤

2017年2月13日星期一

[ Intro2ML ] Ch2. Supervised Learning - Uncertainty estimates from classifiers

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年2月13日 星期一