程式扎記

Source From Here
Preface

Agenda

* What is the purpose of model evaluation, and what are some common evaluation procedures?
* What is the usage of classification accuracy, and what are its limitations?
* What does a confusion matrix describe the performance of a classifier?
* What metrics can be computed from a confusion matrix?
* How can you adjust classifier performance by changing the classification threshold?
* What is the purpose of an ROC curve?
* How does Area Under the Curve (AUC) differ from

Review of model evaluation

* Need a way to choose between models: different model types, tuning parameters, and features
* Use a model evaluation procedure to estimate how well a model will generate to out-of-sample data
* Requires a model evaluation metric to quantify the mode performance

Model evaluation procedures
1. Training and testing on the same data

Rewards overly complex models that "overfit" the training data and won't necessarily generalize

2. Train/test split

Split the dataset into two pieces, so that the model can be trained and tested on different data. Better estimate of out-of-sample performance, but still a "high variance" estimate. Useful due to its speed, simplicity and flexity.

3. K-fold cross validation

Systematically create "K" train/test splits and average the results together. Even better estimate of out-of-sample performance. Runs "K" times which is slower than train/test split.

Model evaluation metrics

* Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
* Classification problems: Classification accuracy

Classification accuracy
- test.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
import pandas as pd  
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'  
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']  
pima = pd.read_csv(url, header=None, names=col_names)  
  
# print the first 5 rows of data  
print pima.head()  
  
# Question: Can we predict the diabetes status of a patient given their health measurements?  
  
# Define X and y  
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']  
X = pima[feature_cols]  
y = pima.label  
  
  
# Split X and y into training and testing sets  
from sklearn.cross_validation import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)  
  
# Train a logistic regression model on the training set  
from sklearn.linear_model import LogisticRegression  
logreg = LogisticRegression()  
logreg.fit(X_train, y_train)  
  
# Make class predictions for the testing set  
y_pred_class = logreg.predict(X_test)  
  
# Calculate accuracy  
from sklearn import metrics  
#print metrics.accuracy_score(y_test, y_pred_class)  
print "Classification accuracy=%.02f (Logistic)" % metrics.accuracy_score(y_test, y_pred_class)  
  
# Null accuracy: accuracy that could be achieved by always predicting the most frequent class  
print "Distribution of lable:\n%s\n" % y_test.value_counts()  
# Calculate the percentage of zeros  
# Calculate null accuracy (for binary classification problems coded as 0/1)  
print "Null accuracy=%.02f" % (max(y_test.mean(), 1-y_test.mean()))  
# For multi-class classification problems  
# print "Null accuracy=%.02f" % (y_test.value_counts().head(1) / len(y_test))  
  
# Comaring the true and predicated response values  
  
# Print the first 25  true and predicted responses  
print "True:", y_test.values[0:25]  
print "Pred:", y_pred_class[0:25]  

Conclusion:

* Classification accuracy is the easiest classification metric to understand
* But, it does not tell you the underlying distribution of response values
* And, it does not tell you what "types" of errors your classifier is making

Confusion Matrix

Tables that describes the performance of a classification model

* Every observation in the testing set is represented in exactly one box
* It's 2x2 matrix because there are 2 response classes
* The format shown here is not universal

Basic terminology

* True Positive (TP): We correctly predicted that they do have diabetes
* True Negative (TN): We correctly predicted that they don't have diabetes
* False Positive (FP): We incorrectly predicted that they do have diabetes (a "Type I error")
* False Negative (FN): We incorrectly predicted that they don't have diabetes (a "Type II error")

Let's check how our prediction works by first 10 predicted responses:

view plaincopy to clipboardprint?
# Print the first 25  true and predicted responses  
print "True:", y_test.values[0:25]  
print "Pred:", y_pred_class[0:25]  

Output:

True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0]
Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]

You can observe that our prediction model pretend to predict 0 while the actual class is 1! Now let's extract TP/TN/FP/FN from confusion matrix:

view plaincopy to clipboardprint?
confusion = metrics.confusion_matrix(y_test, y_pred_class)  
TP = confusion[1, 1]  
TN = confusion[0, 0]  
FP = confusion[0, 1]  
FN = confusion[1, 0]  

Metrics computed from a confusion matrix

Classification Accuracy: Overall, how often is the classifier correct?

view plaincopy to clipboardprint?
# Classification Accuracy = metrics.accuracy_score(y_test, y_pred_class)  
print "Classification Accuracy=%.02f" % ((TP+TN)/float(TP+TN+FP+FN))  

Classification Error: Overall, how often is the classifier incorrect?

view plaincopy to clipboardprint?
# Classification Error = 1 - metrics.accuracy_score(y_test, y_pred_class)  
print "Classification Error=%.02f" % ((FP+FN)/float(TP+TN+FP+FN))  

Sensitivity: When the actual value is positive, how often is the prediction correct?
* How "sensitive" is the classifier to detecting positive instances?
* Also known as "True Positive Rate" or "Recall"

view plaincopy to clipboardprint?
# Sensitivity = metrics.recall_score(y_test, y_pred_class)  
print "Sensitivity=%.02f" % ((TP/float(TP+FN)))  

Specificity: When the actual value is negative, how often is the prediction correct?
* How "specific" (or "selective") is the classifier in predicting positive instances?

view plaincopy to clipboardprint?
# Specificity  
print "Specificity=%.02f" % ((TN/float(TN+FP)))  

False Positive Rate: When the actual value is negative, how often is the prediction incorrect?

view plaincopy to clipboardprint?
# False Positive Rate  
print "False Positive Rate=%.02f" % (FP/float(TN+FP))  

Precision: When a positive value is predicted, how often is the prediction correct?
* How "precise" is the classifier when predicting positive instances?

view plaincopy to clipboardprint?
# Precision = metrics.precision_score(y_test, y_pred_class)  
print "Precision=%.02f" % (TP/float(TP+FP))  

Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.
Conclusion

* Confusion matrix gives you a more complete picture of how your classifier is performing
* Also allows you to compute various classification metrics, and these metrics can guide your model selection

Which metrics should you focus on?
Choice of metrics depends on your business objective. For example:

* Spam filter (positive class is "spam"): Optimize for precision or specificity because false negatives (spam goes to the inbox) are more acceptable than false positive (non-spam is caught by the spam filter)
* Fraudulent transaction detector (positive class is "fraud"): Optimize for sensitivity because false positive (normal transactions that are flagged as possible fraud) are more acceptable than false negative (fraudulent transactions that are not detected)

Is it possible for us to adjust threshold to favor high Sensitivity or high Specificity?

Adjusting the classification threshold
Let's check how Logistic regression works on predicting the first 10 instances:

view plaincopy to clipboardprint?
# Print the first 10 predicted responses  
print "First 10 predicted responses:\n%s\n" % logreg.predict(X_test)[0:10]  
  
# Print the first 10 predicted probabilities of class membership  
print "First 10 predicted probabilities:\n%s\n" % logreg.predict_proba(X_test)[0:10, :]  
  
# Print the first 10 predicted probabilities for class 1  
print "First 10 predicted probabilities for class 1:\n%s\n" % logreg.predict_proba(X_test)[0:10, 1]  

Output:

Let's stored the predicted probabilities for class 1 and draw them in histogram:

view plaincopy to clipboardprint?
# Store the predicted probabilities for class 1  
y_pred_prob = logreg.predict_proba(X_test)[:, 1]  
  
# Allow plots to appear in the notebook  
import matplotlib.pyplot as plt  
plt.rcParams['font.size'] = 14  
  
# Histogram of predicted probabilities  
plt.hist(y_pred_prob, bins=8)  
plt.xlim(0, 1)  
plt.title('Histogram of predicted probabilities')  
plt.xlabel('Predicted probability of diabetes')  
plt.ylabel('Frequency')  
plt.show()  

The output histogram indicates that this model tends to predict class 0:

Actually, we can decrease the threshold (now is 0.5) for predicting diabetes in order to increase the sensitivity of the classifier. Let's check what if we adjust the threshold to 0.3:

view plaincopy to clipboardprint?
# predict diabetes if the predicted probability is greater than 0.3  
from sklearn.preprocessing import binarize  
y_pred_class = binarize(y_pred_prob, 0.3)[0]  
  
# Print the first 10 predicted probabilities  
print y_pred_prob[0:10]  
  
# Print the first 10 predicted classes with the lower threshold  
print y_pred_class[0:10]  
  
# Previous confusion matrix (default threshold of 0.5)  
print "Confusion matrix with threshold=0.5:\n%s\n" % confusion  
  
# New confusion matrix (threshold of 0.3)  
confusion_new = metrics.confusion_matrix(y_test, y_pred_class)  
TP = confusion_new[1, 1]  
TN = confusion_new[0, 0]  
FP = confusion_new[0, 1]  
FN = confusion_new[1, 0]  
print "Confusion matrix with threshold=0.3:\n%s\n" % confusion_new  
  
# Sensitivity has increaseed (used to be 0.24)  
print "Current Sensitivity=%.02f (Used to be 0.24)" % ((TP/float(TP+FN)))  
  
# Specificity has decreased (used to be 0.91)  
print "Current Specificity=%.02f (Used to be 0.91)" % ((TN/float(TN+FP)))  

The output:

We can notify the impact of decreasing threshold will cause Sensitivity to increase and Specificity to decrease!

Conclusion:

* Threshold of 0.5 is used by default (for binary problems) to convert predicted probabilities into class predictions
* Threshold can be adjusted to increase sensitivity or specificity
* Sensitivity and specificity have an inverse relationship

ROC Curves and Area Under the Curve (AUC)
Question: Wouldn't it be nice if we could see how sensitivity and specificity are affected by various thresholds, without actually changing the threshold?
Answer: Plot the ROC curve!

* ROC curve can help you to choose a threshold that balances sensitivity and specificity in a way that makes senses for your particular context
* You can't actually see the thresholds used to generate the curve on the ROC curve itself

Let's create a function to calculate the Sensitivity & Specificity of specific threshold:

view plaincopy to clipboardprint?
def evaluate_threshold(t):  
    print 'Sensitivity: %.02f' % tpr[thresholds > t][-1]  
    print 'Specificity: %.02f' % (1 - fpr[thresholds > t][-1])  
  
evaluate_threshold(0.5)  
evaluate_threshold(0.3)  

The output:

Sensitivity: 0.24
Specificity: 0.91
Sensitivity: 0.73
Specificity: 0.62

As expectation, the lower threshold will have higher sensitivity and lower specificity. Next we can draw the ROC curves with below code:

view plaincopy to clipboardprint?
# IMPORTANT: first argument is true values, second argument is predicted probabilities  
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)  
plt.plot(fpr, tpr)  
plt.xlim([0.0, 1.0])  
plt.ylim([0.0, 1.0])  
plt.title('ROC curve for diabetes classifier')  
plt.xlabel('False Positive Rate (1 - Specificity)')  
plt.ylabel('True Positive Rate (Sensitivity)')  
plt.grid(True)  
plt.show()  

The output:

AOC is the percentage of the ROC plot that is underneath the curve: (sklearn.metrics.roc_auc_score)

view plaincopy to clipboardprint?
# IMPORTANT: first argument is true values, second argument is predicted probabilities  
print "AUC=%.02f" % metrics.roc_auc_score(y_test, y_pred_prob)  

The output:

AUC=0.72

* AUC is useful as a single number summary of classifier performance.
* If you randomly chose one positive and one negative observation, AUC represents the likelihood that your classifier will assign a higher predicted probability to the positive observation.
* AUC is useful even when there is high class imbalance (unlike classification accuracy)

You can use it as score while doing cross-validation:

view plaincopy to clipboardprint?
# Calculate cross-validated AUC  
from sklearn.cross_validation import cross_val_score  
print cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()  

Confusion matrix advantages:

* Allows you to calculate a variety of metrics
* Useful for multi-class problems (more than two response classes)

ROC/AUC advantages

* Does not require you to set a classification threshold
* Still useful when there is high class imbalance

程式扎記

標籤

2017年1月5日星期四

[ Scikit- learn ] How to evaluate a classifier in scikit-learn

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年1月5日 星期四

[ Scikit- learn ] How to evaluate a classifier in scikit-learn

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2017年1月5日星期四