Source From Here
Preface
Agenda
Review of model evaluation
Model evaluation procedures
1. Training and testing on the same data
2. Train/test split
3. K-fold cross validation
Model evaluation metrics
Classification accuracy
- test.py
Conclusion:
Confusion Matrix
Tables that describes the performance of a classification model
Basic terminology
Let's check how our prediction works by first 10 predicted responses:
Output:
You can observe that our prediction model pretend to predict 0 while the actual class is 1! Now let's extract TP/TN/FP/FN from confusion matrix:
Metrics computed from a confusion matrix
Classification Accuracy: Overall, how often is the classifier correct?
Classification Error: Overall, how often is the classifier incorrect?
Sensitivity: When the actual value is positive, how often is the prediction correct?
* How "sensitive" is the classifier to detecting positive instances?
* Also known as "True Positive Rate" or "Recall"
Specificity: When the actual value is negative, how often is the prediction correct?
* How "specific" (or "selective") is the classifier in predicting positive instances?
False Positive Rate: When the actual value is negative, how often is the prediction incorrect?
Precision: When a positive value is predicted, how often is the prediction correct?
* How "precise" is the classifier when predicting positive instances?
Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.
Conclusion
Which metrics should you focus on?
Choice of metrics depends on your business objective. For example:
Is it possible for us to adjust threshold to favor high Sensitivity or high Specificity?
Adjusting the classification threshold
Let's check how Logistic regression works on predicting the first 10 instances:
Output:
Let's stored the predicted probabilities for class 1 and draw them in histogram:
The output histogram indicates that this model tends to predict class 0:
Actually, we can decrease the threshold (now is 0.5) for predicting diabetes in order to increase the sensitivity of the classifier. Let's check what if we adjust the threshold to 0.3:
The output:
We can notify the impact of decreasing threshold will cause Sensitivity to increase and Specificity to decrease!
Conclusion:
ROC Curves and Area Under the Curve (AUC)
Question: Wouldn't it be nice if we could see how sensitivity and specificity are affected by various thresholds, without actually changing the threshold?
Answer: Plot the ROC curve!
Let's create a function to calculate the Sensitivity & Specificity of specific threshold:
The output:
As expectation, the lower threshold will have higher sensitivity and lower specificity. Next we can draw the ROC curves with below code:
The output:
AOC is the percentage of the ROC plot that is underneath the curve: (sklearn.metrics.roc_auc_score)
The output:
You can use it as score while doing cross-validation:
Confusion matrix advantages:
ROC/AUC advantages
Preface
Agenda
Review of model evaluation
Model evaluation procedures
1. Training and testing on the same data
2. Train/test split
3. K-fold cross validation
Model evaluation metrics
Classification accuracy
- test.py
- #!/usr/bin/env python
- import pandas as pd
- url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
- col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
- pima = pd.read_csv(url, header=None, names=col_names)
- # print the first 5 rows of data
- print pima.head()
- # Question: Can we predict the diabetes status of a patient given their health measurements?
- # Define X and y
- feature_cols = ['pregnant', 'insulin', 'bmi', 'age']
- X = pima[feature_cols]
- y = pima.label
- # Split X and y into training and testing sets
- from sklearn.cross_validation import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
- # Train a logistic regression model on the training set
- from sklearn.linear_model import LogisticRegression
- logreg = LogisticRegression()
- logreg.fit(X_train, y_train)
- # Make class predictions for the testing set
- y_pred_class = logreg.predict(X_test)
- # Calculate accuracy
- from sklearn import metrics
- #print metrics.accuracy_score(y_test, y_pred_class)
- print "Classification accuracy=%.02f (Logistic)" % metrics.accuracy_score(y_test, y_pred_class)
- # Null accuracy: accuracy that could be achieved by always predicting the most frequent class
- print "Distribution of lable:\n%s\n" % y_test.value_counts()
- # Calculate the percentage of zeros
- # Calculate null accuracy (for binary classification problems coded as 0/1)
- print "Null accuracy=%.02f" % (max(y_test.mean(), 1-y_test.mean()))
- # For multi-class classification problems
- # print "Null accuracy=%.02f" % (y_test.value_counts().head(1) / len(y_test))
- # Comaring the true and predicated response values
- # Print the first 25 true and predicted responses
- print "True:", y_test.values[0:25]
- print "Pred:", y_pred_class[0:25]
Confusion Matrix
Tables that describes the performance of a classification model
Basic terminology
Let's check how our prediction works by first 10 predicted responses:
- # Print the first 25 true and predicted responses
- print "True:", y_test.values[0:25]
- print "Pred:", y_pred_class[0:25]
You can observe that our prediction model pretend to predict 0 while the actual class is 1! Now let's extract TP/TN/FP/FN from confusion matrix:
- confusion = metrics.confusion_matrix(y_test, y_pred_class)
- TP = confusion[1, 1]
- TN = confusion[0, 0]
- FP = confusion[0, 1]
- FN = confusion[1, 0]
Metrics computed from a confusion matrix
Classification Accuracy: Overall, how often is the classifier correct?
- # Classification Accuracy = metrics.accuracy_score(y_test, y_pred_class)
- print "Classification Accuracy=%.02f" % ((TP+TN)/float(TP+TN+FP+FN))
Classification Error: Overall, how often is the classifier incorrect?
- # Classification Error = 1 - metrics.accuracy_score(y_test, y_pred_class)
- print "Classification Error=%.02f" % ((FP+FN)/float(TP+TN+FP+FN))
Sensitivity: When the actual value is positive, how often is the prediction correct?
* How "sensitive" is the classifier to detecting positive instances?
* Also known as "True Positive Rate" or "Recall"
- # Sensitivity = metrics.recall_score(y_test, y_pred_class)
- print "Sensitivity=%.02f" % ((TP/float(TP+FN)))
Specificity: When the actual value is negative, how often is the prediction correct?
* How "specific" (or "selective") is the classifier in predicting positive instances?
- # Specificity
- print "Specificity=%.02f" % ((TN/float(TN+FP)))
False Positive Rate: When the actual value is negative, how often is the prediction incorrect?
- # False Positive Rate
- print "False Positive Rate=%.02f" % (FP/float(TN+FP))
Precision: When a positive value is predicted, how often is the prediction correct?
* How "precise" is the classifier when predicting positive instances?
- # Precision = metrics.precision_score(y_test, y_pred_class)
- print "Precision=%.02f" % (TP/float(TP+FP))
Conclusion
Which metrics should you focus on?
Choice of metrics depends on your business objective. For example:
Is it possible for us to adjust threshold to favor high Sensitivity or high Specificity?
Adjusting the classification threshold
Let's check how Logistic regression works on predicting the first 10 instances:
- # Print the first 10 predicted responses
- print "First 10 predicted responses:\n%s\n" % logreg.predict(X_test)[0:10]
- # Print the first 10 predicted probabilities of class membership
- print "First 10 predicted probabilities:\n%s\n" % logreg.predict_proba(X_test)[0:10, :]
- # Print the first 10 predicted probabilities for class 1
- print "First 10 predicted probabilities for class 1:\n%s\n" % logreg.predict_proba(X_test)[0:10, 1]
Let's stored the predicted probabilities for class 1 and draw them in histogram:
- # Store the predicted probabilities for class 1
- y_pred_prob = logreg.predict_proba(X_test)[:, 1]
- # Allow plots to appear in the notebook
- import matplotlib.pyplot as plt
- plt.rcParams['font.size'] = 14
- # Histogram of predicted probabilities
- plt.hist(y_pred_prob, bins=8)
- plt.xlim(0, 1)
- plt.title('Histogram of predicted probabilities')
- plt.xlabel('Predicted probability of diabetes')
- plt.ylabel('Frequency')
- plt.show()
Actually, we can decrease the threshold (now is 0.5) for predicting diabetes in order to increase the sensitivity of the classifier. Let's check what if we adjust the threshold to 0.3:
- # predict diabetes if the predicted probability is greater than 0.3
- from sklearn.preprocessing import binarize
- y_pred_class = binarize(y_pred_prob, 0.3)[0]
- # Print the first 10 predicted probabilities
- print y_pred_prob[0:10]
- # Print the first 10 predicted classes with the lower threshold
- print y_pred_class[0:10]
- # Previous confusion matrix (default threshold of 0.5)
- print "Confusion matrix with threshold=0.5:\n%s\n" % confusion
- # New confusion matrix (threshold of 0.3)
- confusion_new = metrics.confusion_matrix(y_test, y_pred_class)
- TP = confusion_new[1, 1]
- TN = confusion_new[0, 0]
- FP = confusion_new[0, 1]
- FN = confusion_new[1, 0]
- print "Confusion matrix with threshold=0.3:\n%s\n" % confusion_new
- # Sensitivity has increaseed (used to be 0.24)
- print "Current Sensitivity=%.02f (Used to be 0.24)" % ((TP/float(TP+FN)))
- # Specificity has decreased (used to be 0.91)
- print "Current Specificity=%.02f (Used to be 0.91)" % ((TN/float(TN+FP)))
We can notify the impact of decreasing threshold will cause Sensitivity to increase and Specificity to decrease!
Conclusion:
ROC Curves and Area Under the Curve (AUC)
Question: Wouldn't it be nice if we could see how sensitivity and specificity are affected by various thresholds, without actually changing the threshold?
Answer: Plot the ROC curve!
Let's create a function to calculate the Sensitivity & Specificity of specific threshold:
- def evaluate_threshold(t):
- print 'Sensitivity: %.02f' % tpr[thresholds > t][-1]
- print 'Specificity: %.02f' % (1 - fpr[thresholds > t][-1])
- evaluate_threshold(0.5)
- evaluate_threshold(0.3)
As expectation, the lower threshold will have higher sensitivity and lower specificity. Next we can draw the ROC curves with below code:
- # IMPORTANT: first argument is true values, second argument is predicted probabilities
- fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
- plt.plot(fpr, tpr)
- plt.xlim([0.0, 1.0])
- plt.ylim([0.0, 1.0])
- plt.title('ROC curve for diabetes classifier')
- plt.xlabel('False Positive Rate (1 - Specificity)')
- plt.ylabel('True Positive Rate (Sensitivity)')
- plt.grid(True)
- plt.show()
AOC is the percentage of the ROC plot that is underneath the curve: (sklearn.metrics.roc_auc_score)
- # IMPORTANT: first argument is true values, second argument is predicted probabilities
- print "AUC=%.02f" % metrics.roc_auc_score(y_test, y_pred_prob)
You can use it as score while doing cross-validation:
- # Calculate cross-validated AUC
- from sklearn.cross_validation import cross_val_score
- print cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
Confusion matrix advantages:
ROC/AUC advantages
沒有留言:
張貼留言