程式扎記: [ Scikit- learn ] How to evaluate a classifier in scikit-learn

標籤

2017年1月5日 星期四

[ Scikit- learn ] How to evaluate a classifier in scikit-learn

Source From Here
Preface

Agenda
* What is the purpose of model evaluation, and what are some common evaluation procedures?
* What is the usage of classification accuracy, and what are its limitations?
* What does a confusion matrix describe the performance of a classifier?
* What metrics can be computed from a confusion matrix?
How can you adjust classifier performance by changing the classification threshold?
* What is the purpose of an ROC curve?
* How does Area Under the Curve (AUC) differ from


Review of model evaluation
* Need a way to choose between models: different model types, tuning parameters, and features
* Use a model evaluation procedure to estimate how well a model will generate to out-of-sample data
* Requires a model evaluation metric to quantify the mode performance

Model evaluation procedures
1. Training and testing on the same data
Rewards overly complex models that "overfit" the training data and won't necessarily generalize

2. Train/test split
Split the dataset into two pieces, so that the model can be trained and tested on different data. Better estimate of out-of-sample performance, but still a "high variance" estimate. Useful due to its speed, simplicity and flexity.

3. K-fold cross validation
Systematically create "K" train/test splits and average the results together. Even better estimate of out-of-sample performance. Runs "K" times which is slower than train/test split.

Model evaluation metrics
* Regression problems: Mean Absolute Error, Mean Squared Error, Root Mean Squared Error
* Classification problems: Classification accuracy

Classification accuracy
- test.py
  1. #!/usr/bin/env python  
  2. import pandas as pd  
  3. url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'  
  4. col_names = ['pregnant''glucose''bp''skin''insulin''bmi''pedigree''age''label']  
  5. pima = pd.read_csv(url, header=None, names=col_names)  
  6.   
  7. # print the first 5 rows of data  
  8. print pima.head()  
  9.   
  10. # Question: Can we predict the diabetes status of a patient given their health measurements?  
  11.   
  12. # Define X and y  
  13. feature_cols = ['pregnant''insulin''bmi''age']  
  14. X = pima[feature_cols]  
  15. y = pima.label  
  16.   
  17.   
  18. # Split X and y into training and testing sets  
  19. from sklearn.cross_validation import train_test_split  
  20. X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)  
  21.   
  22. # Train a logistic regression model on the training set  
  23. from sklearn.linear_model import LogisticRegression  
  24. logreg = LogisticRegression()  
  25. logreg.fit(X_train, y_train)  
  26.   
  27. # Make class predictions for the testing set  
  28. y_pred_class = logreg.predict(X_test)  
  29.   
  30. # Calculate accuracy  
  31. from sklearn import metrics  
  32. #print metrics.accuracy_score(y_test, y_pred_class)  
  33. print "Classification accuracy=%.02f (Logistic)" % metrics.accuracy_score(y_test, y_pred_class)  
  34.   
  35. # Null accuracy: accuracy that could be achieved by always predicting the most frequent class  
  36. print "Distribution of lable:\n%s\n" % y_test.value_counts()  
  37. # Calculate the percentage of zeros  
  38. # Calculate null accuracy (for binary classification problems coded as 0/1)  
  39. print "Null accuracy=%.02f" % (max(y_test.mean(), 1-y_test.mean()))  
  40. # For multi-class classification problems  
  41. # print "Null accuracy=%.02f" % (y_test.value_counts().head(1) / len(y_test))  
  42.   
  43. # Comaring the true and predicated response values  
  44.   
  45. # Print the first 25  true and predicted responses  
  46. print "True:", y_test.values[0:25]  
  47. print "Pred:", y_pred_class[0:25]  
Conclusion:
* Classification accuracy is the easiest classification metric to understand
* But, it does not tell you the underlying distribution of response values
* And, it does not tell you what "types" of errors your classifier is making

Confusion Matrix

Tables that describes the performance of a classification model
* Every observation in the testing set is represented in exactly one box
* It's 2x2 matrix because there are 2 response classes
* The format shown here is not universal

Basic terminology
* True Positive (TP): We correctly predicted that they do have diabetes
* True Negative (TN): We correctly predicted that they don't have diabetes
* False Positive (FP): We incorrectly predicted that they do have diabetes (a "Type I error")
* False Negative (FN): We incorrectly predicted that they don't have diabetes (a "Type II error")

Let's check how our prediction works by first 10 predicted responses:
  1. # Print the first 25  true and predicted responses  
  2. print "True:", y_test.values[0:25]  
  3. print "Pred:", y_pred_class[0:25]  
Output:
True: [1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0]
Pred: [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]

You can observe that our prediction model pretend to predict 0 while the actual class is 1! Now let's extract TP/TN/FP/FN from confusion matrix:
  1. confusion = metrics.confusion_matrix(y_test, y_pred_class)  
  2. TP = confusion[11]  
  3. TN = confusion[00]  
  4. FP = confusion[01]  
  5. FN = confusion[10]  


Metrics computed from a confusion matrix

Classification Accuracy: Overall, how often is the classifier correct?
  1. # Classification Accuracy = metrics.accuracy_score(y_test, y_pred_class)  
  2. print "Classification Accuracy=%.02f" % ((TP+TN)/float(TP+TN+FP+FN))  

Classification Error: Overall, how often is the classifier incorrect?
  1. # Classification Error = 1 - metrics.accuracy_score(y_test, y_pred_class)  
  2. print "Classification Error=%.02f" % ((FP+FN)/float(TP+TN+FP+FN))  

Sensitivity: When the actual value is positive, how often is the prediction correct?
* How "sensitive" is the classifier to detecting positive instances?
* Also known as "True Positive Rate" or "Recall"
  1. # Sensitivity = metrics.recall_score(y_test, y_pred_class)  
  2. print "Sensitivity=%.02f" % ((TP/float(TP+FN)))  

Specificity: When the actual value is negative, how often is the prediction correct?
* How "specific" (or "selective") is the classifier in predicting positive instances?
  1. # Specificity  
  2. print "Specificity=%.02f" % ((TN/float(TN+FP)))  

False Positive Rate: When the actual value is negative, how often is the prediction incorrect?
  1. # False Positive Rate  
  2. print "False Positive Rate=%.02f" % (FP/float(TN+FP))  

Precision: When a positive value is predicted, how often is the prediction correct?
* How "precise" is the classifier when predicting positive instances?
  1. # Precision = metrics.precision_score(y_test, y_pred_class)  
  2. print "Precision=%.02f" % (TP/float(TP+FP))  
Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.
Conclusion
* Confusion matrix gives you a more complete picture of how your classifier is performing
* Also allows you to compute various classification metrics, and these metrics can guide your model selection

Which metrics should you focus on?
Choice of metrics depends on your business objective. For example:
* Spam filter (positive class is "spam"): Optimize for precision or specificity because false negatives (spam goes to the inbox) are more acceptable than false positive (non-spam is caught by the spam filter)
* Fraudulent transaction detector (positive class is "fraud"): Optimize for sensitivity because false positive (normal transactions that are flagged as possible fraud) are more acceptable than false negative (fraudulent transactions that are not detected)

Is it possible for us to adjust threshold to favor high Sensitivity or high Specificity?

Adjusting the classification threshold
Let's check how Logistic regression works on predicting the first 10 instances:
  1. # Print the first 10 predicted responses  
  2. print "First 10 predicted responses:\n%s\n" % logreg.predict(X_test)[0:10]  
  3.   
  4. # Print the first 10 predicted probabilities of class membership  
  5. print "First 10 predicted probabilities:\n%s\n" % logreg.predict_proba(X_test)[0:10, :]  
  6.   
  7. # Print the first 10 predicted probabilities for class 1  
  8. print "First 10 predicted probabilities for class 1:\n%s\n" % logreg.predict_proba(X_test)[0:101]  
Output:

Let's stored the predicted probabilities for class 1 and draw them in histogram:
  1. # Store the predicted probabilities for class 1  
  2. y_pred_prob = logreg.predict_proba(X_test)[:, 1]  
  3.   
  4. # Allow plots to appear in the notebook  
  5. import matplotlib.pyplot as plt  
  6. plt.rcParams['font.size'] = 14  
  7.   
  8. # Histogram of predicted probabilities  
  9. plt.hist(y_pred_prob, bins=8)  
  10. plt.xlim(01)  
  11. plt.title('Histogram of predicted probabilities')  
  12. plt.xlabel('Predicted probability of diabetes')  
  13. plt.ylabel('Frequency')  
  14. plt.show()  
The output histogram indicates that this model tends to predict class 0:


Actually, we can decrease the threshold (now is 0.5) for predicting diabetes in order to increase the sensitivity of the classifier. Let's check what if we adjust the threshold to 0.3:
  1. # predict diabetes if the predicted probability is greater than 0.3  
  2. from sklearn.preprocessing import binarize  
  3. y_pred_class = binarize(y_pred_prob, 0.3)[0]  
  4.   
  5. # Print the first 10 predicted probabilities  
  6. print y_pred_prob[0:10]  
  7.   
  8. # Print the first 10 predicted classes with the lower threshold  
  9. print y_pred_class[0:10]  
  10.   
  11. # Previous confusion matrix (default threshold of 0.5)  
  12. print "Confusion matrix with threshold=0.5:\n%s\n" % confusion  
  13.   
  14. # New confusion matrix (threshold of 0.3)  
  15. confusion_new = metrics.confusion_matrix(y_test, y_pred_class)  
  16. TP = confusion_new[11]  
  17. TN = confusion_new[00]  
  18. FP = confusion_new[01]  
  19. FN = confusion_new[10]  
  20. print "Confusion matrix with threshold=0.3:\n%s\n" % confusion_new  
  21.   
  22. # Sensitivity has increaseed (used to be 0.24)  
  23. print "Current Sensitivity=%.02f (Used to be 0.24)" % ((TP/float(TP+FN)))  
  24.   
  25. # Specificity has decreased (used to be 0.91)  
  26. print "Current Specificity=%.02f (Used to be 0.91)" % ((TN/float(TN+FP)))  
The output:


We can notify the impact of decreasing threshold will cause Sensitivity to increase and Specificity to decrease!

Conclusion:
* Threshold of 0.5 is used by default (for binary problems) to convert predicted probabilities into class predictions
* Threshold can be adjusted to increase sensitivity or specificity
* Sensitivity and specificity have an inverse relationship


ROC Curves and Area Under the Curve (AUC)
Question: Wouldn't it be nice if we could see how sensitivity and specificity are affected by various thresholds, without actually changing the threshold?
Answer: Plot the ROC curve!
* ROC curve can help you to choose a threshold that balances sensitivity and specificity in a way that makes senses for your particular context
* You can't actually see the thresholds used to generate the curve on the ROC curve itself

Let's create a function to calculate the Sensitivity & Specificity of specific threshold:
  1. def evaluate_threshold(t):  
  2.     print 'Sensitivity: %.02f' % tpr[thresholds > t][-1]  
  3.     print 'Specificity: %.02f' % (1 - fpr[thresholds > t][-1])  
  4.   
  5. evaluate_threshold(0.5)  
  6. evaluate_threshold(0.3)  
The output:
Sensitivity: 0.24
Specificity: 0.91
Sensitivity: 0.73
Specificity: 0.62

As expectation, the lower threshold will have higher sensitivity and lower specificity. Next we can draw the ROC curves with below code:
  1. # IMPORTANT: first argument is true values, second argument is predicted probabilities  
  2. fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)  
  3. plt.plot(fpr, tpr)  
  4. plt.xlim([0.01.0])  
  5. plt.ylim([0.01.0])  
  6. plt.title('ROC curve for diabetes classifier')  
  7. plt.xlabel('False Positive Rate (1 - Specificity)')  
  8. plt.ylabel('True Positive Rate (Sensitivity)')  
  9. plt.grid(True)  
  10. plt.show()  
The output:


AOC is the percentage of the ROC plot that is underneath the curve: (sklearn.metrics.roc_auc_score)
  1. # IMPORTANT: first argument is true values, second argument is predicted probabilities  
  2. print "AUC=%.02f" % metrics.roc_auc_score(y_test, y_pred_prob)  
The output:
AUC=0.72

* AUC is useful as a single number summary of classifier performance.
* If you randomly chose one positive and one negative observation, AUC represents the likelihood that your classifier will assign a higher predicted probability to the positive observation.
* AUC is useful even when there is high class imbalance (unlike classification accuracy)

You can use it as score while doing cross-validation:
  1. # Calculate cross-validated AUC  
  2. from sklearn.cross_validation import cross_val_score  
  3. print cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()  

Confusion matrix advantages:
* Allows you to calculate a variety of metrics
* Useful for multi-class problems (more than two response classes)

ROC/AUC advantages
* Does not require you to set a classification threshold
* Still useful when there is high class imbalance


沒有留言:

張貼留言

網誌存檔

關於我自己

我的相片
Where there is a will, there is a way!