2017年2月13日 星期一

[ Intro2ML ] Ch2. Supervised Learning - Uncertainty estimates from classifiers

Uncertainty estimates from classifiers 
Another useful part of the scikit-learn interface that we haven’t talked about yet is the ability of classifiers to provide uncertainty estimates of predictions. 

Often, you are not only interested in which class a classifier predicts for a certain test point, but also how certain it is that this is the right class. In practice, different kinds of mistakes lead to very different outcomes in real world applications. Imagine a medical application testing for cancer. Making a false positive prediction might lead to a patient undergoing additional tests, while a false negative prediction might lead to a serious disease not being treated. 

We will go into this topic in more detail in Chapter 6 (Model Selection). 

There are two different functions in scikit-learn that can be used to obtain uncertainty estimates from classifiers, decision_function and predict_proba. Most (but not all) classifiers have at least one of them, and many classifiers have both. Let’s look at what these two functions do on as synthetic two-dimensional dataset, when building a GradientBoostingClassifier classifier. GradientBoostingClassifier has both a decision_function method and a predict_proba
- ch2_t38.py 
  1. from sklearn.ensemble import GradientBoostingClassifier  
  2. from sklearn.datasets import make_blobs, make_circles  
  3. from sklearn.model_selection import train_test_split  
  4. import numpy as np  
  5. # X, y = make_blobs(centers=2, random_state=59)  
  6. X, y = make_circles(noise=0.25, factor=0.5, random_state=1)  
  7.   
  8. # we rename the classes "blue" and "red" for illustration purposes:  
  9. y_named = np.array(["blue""red"])[y]  
  10.   
  11. # we can call train test split with arbitrary many arrays  
  12. # all will be split in a consistent manner  
  13. X_train, X_test, y_train_named, y_test_named, y_train, y_test = \  
  14. train_test_split(X, y_named, y, random_state=0)  
  15. # build the gradient boosting model model  
  16. gbrt = GradientBoostingClassifier(random_state=0)  
  17. gbrt.fit(X_train, y_train_named)  
  18.   
  19. print("accuracy on training set: %f" % gbrt.score(X_train, y_train_named))  
  20. print("accuracy on test set: %f" % gbrt.score(X_test, y_test_named))  
The Decision Function 
In the binary classification case, the return value of decision_function is of shape (n_samples,), it returns one floating point number for each sample: 
>>> from ch2_t38 import *
accuracy on training set: 1.000000
accuracy on test set: 0.840000

>>> print X_test.shape
(25, 2)
>>> print gbrt.decision_function(X_test).shape
(25,)

This value encodes how strongly the model believes a data point to belong to the “positive” class, in this case class 1. Positive values indicate a preference for the positive class, negative values indicate preference for the “negative”, that is the other class: 
>>> gbrt.decision_function(X_test)[:6] # show the first few entries of decision_function
array([ 4.13592629, -1.7016989 , -3.95106099, -3.62599351, 4.28986668, 3.66166106])

We can recover the prediction by looking only at the sign of the decision function: 
>>> print(gbrt.decision_function(X_test) > 0)
[ True False False...]
>>> print(gbrt.predict(X_test))
['red' 'blue' 'blue' ...]

For binary classification, the “negative” class is always the first entry of the classes_ attribute, and the “positive” class is the second entry of classes_. So if you want to fully recover the output of predict, you need to make use of the classes_ attribute: 
>>> greater_zero = (gbrt.decision_function(X_test) > 0).astype(int) # make the boolean True/False into 0 and 1
>>> pred = gbrt.classes_[greater_zero] # use 0 and 1 as indices into classes_
>>> np.all(pred == gbrt.predict(X_test)) # pred is the same as the output of gbrt.predict
True

The range of decision_function can be arbitrary, and depends on the data and the model parameters: 
>>> np.min(decision_function), np.max(decision_function)
(-7.6909717730121798, 4.289866676868515)

This arbitrary scaling makes the output of decision_function often hard to interpret. Below we plot the decision_function for all points in the 2d plane using a color coding, next to a visualization of the decision boundary, as we saw it in Chapter 2. We show training points as circles and test data as triangles. 
  1. import matplotlib.pyplot as plt  
  2. import mglearn  
  3. import os  
  4. dmode = os.environ.get('DISPLAY''')  
  5. if dmode:  
  6.     fig, axes = plt.subplots(12, figsize=(135))  
  7.     mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)  
  8.     scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm='bwr')  
  9.     for ax in axes:  
  10.     # plot training and test points  
  11.         ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=mglearn.cm2, s=60, marker='^')  
  12.         ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=mglearn.cm2, s=60)  
  13.     plt.colorbar(scores_image, ax=axes.tolist())  
  14.     plt.show()  

Encoding not only the predicted outcome, but also how certain the classifier is provides additional information. However, in this visualization, it is hard to make out the boundary between the two classes. 

Predicting probabilities 
The output of predict_proba however is a probability for each class, and is often more easily understood. It is always of shape (n_samples, 2) for binary classification: 
>>> from ch2_t38 import *
>>> gbrt.predict_proba(X_test).shape
(25, 2)

The first entry in each row is the estimated probability of the first class, the second entry is the estimated probability of the second class. Because it is a probability, the output of predict_proba is always between zero and 1, and the sum of the entries for both classes is always 1: 
>>> np.set_printoptions(suppress=True, precision=3)
>>> gbrt.predict_proba(X_test[:6]) # show the first few entries of predict_proba
array([[ 0.016, 0.984],
[ 0.846, 0.154],
[ 0.981, 0.019],
...]

Because the probabilities for the two classes sum to one, exactly one of the classes is above 50% certainty. That class is the one that is predicted. 

You can see in the output above, that the classifier is relatively certain for most points. How well the uncertainty actually reflects uncertainty in the data depends on the model and parameters. A model that is more overfit tends to make more certain predictions, even if they might be wrong. A model with less complexity usually has more uncertainty in predictions. A model is called calibrated if the reported uncertainty actually matches how correct it is - in a calibrated model, a prediction made with 70% certainty would be correct 70% of the time. 

Below we show again the decision boundary on the dataset, next to the class probabilities for the blue class: 
  1. fig, axes = plt.subplots(12, figsize=(135))  
  2. mglearn.tools.plot_2d_separator(gbrt, X, ax=axes[0], alpha=.4, fill=True, cm=mglearn.cm2)  
  3. scores_image = mglearn.tools.plot_2d_scores(gbrt, X, ax=axes[1], alpha=.4, cm='bwr', function='predict_proba')  
  4. for ax in axes:  
  5. # plot training and test points  
  6.     ax.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=mglearn.cm2, s=60, marker='^')  
  7.     ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=mglearn.cm2, s=60)  
  8. plt.colorbar(scores_image, ax=axes.tolist())  
  9. plt.show()  

The boundaries in this this plot are much more well-defined, and the small areas of uncertainty are clearly visible. The scikit-learn website (Footnote: http://scikit-learn.org/stable/auto_examples/class...lot_classifier_comparison.html) has a great comparison of many models, and how their uncertainty estimates look like. 

We reproduced the figure below, and encourage you to go though the example there. 

Uncertainty in multi-class classification 
Above we only talked about uncertainty estimates in binary classification. But the decision_function and predict_proba methods also work in the multi-class setting. Let’s apply them on the iris dataset, which is a three-class classification dataset: 
- ch2_t39.py 
  1. # create and split a synthetic dataset  
  2. from sklearn.ensemble import GradientBoostingClassifier  
  3. from sklearn.datasets import load_iris  
  4. from sklearn.model_selection import train_test_split  
  5. import numpy as np  
  6. # X, y = make_blobs(centers=2, random_state=59)  
  7. iris = load_iris()  
  8. X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)  
  9.   
  10. # build the gradient boosting model model  
  11. gbrt = GradientBoostingClassifier(learning_rate=0.01, random_state=0)  
  12. gbrt.fit(X_train, y_train)  
  13.   
  14. print("accuracy on training set: %f" % gbrt.score(X_train, y_train))  
  15. print("accuracy on test set: %f" % gbrt.score(X_test, y_test))  
  16. print("shape of X_test: %s" % str(gbrt.decision_function(X_test).shape))  
  17. print("first 6 element from decision_function:\n%s\n" % str(gbrt.decision_function(X_test[:6,:])))  
In the multi-class case, the decision_function has the shape (n_samples, n_classes), and each column provides a “certainty score” for each class, where a large score means that a class is more likely, and a small score means the class is less likely. You can recover the prediction from these scores by finding the maximum entry for each data point: 
>>> from ch2_t39 import *
>>> print(np.argmax(gbrt.decision_function(X_test), axis=1))
[1 0 2 1 1 0 1 2 ...]
>>> print(gbrt.predict(X_test))
[1 0 2 1 1 0 1 2 1 1 2 ...]

The output of predict_proba has the same shape, (n_samples, n_classes). Again, the probabilities for the possible classes for each data point sum to one: 
>>> print(gbrt.predict_proba(X_test)[:3]) # show the first three entries of predict_proba
[[ 0.10664722 0.7840248 0.10932798]
[ 0.78880668 0.10599243 0.10520089]
[ 0.10231173 0.10822274 0.78946553]]

>>> print("sums: %s" % gbrt.predict_proba(X_test)[:3].sum(axis=1)) # show that sums across rows are one
sums: [ 1. 1. 1.]

We can again recover the predictions by computing the argmax of predict_proba
>>> print(np.argmax(gbrt.decision_function(X_test[:3]), axis=1))
[1 0 2]
>>> print(gbrt.predict(X_test[:3]))
[1 0 2]
>>> gbrt.classes_
array([0, 1, 2])

To summarize, predict_proba and decision_function always have shape (n_samples, n_classes) -- apart from the special case of decision_function in the binary case. In the binary case, decision_function only has one column, corresponding to the “positive” class classes which is mostly for historical reasons. You can recover the prediction when there are n_classes many columns by simply computing the argmax across columns. 

Be careful, though, if your classes are strings, or you use integers, but they are not consecutive and starting from 0. If you want to compare results obtained with predict to results obtained via decision_function or predict_proba make sure to use the classes_ attribute of the classifier to get the actual class names. 

Supplement 
Selecting the best model in scikit-learn using cross-validation

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...