Preface
Agenda
* How do I choose which model to use for my supervised learning
* How do I choose the best turning parameters for that model
* How do I estimate the likely performance of my model on out-of-sample data
Review
* Classification task: Predicting the species of an unknown iris
* Used three classification models: KNN(K=1), KNN(K=5), logistic regression
* Need a way to choose between the models
Evaluation Procedure
Evaluation procedure #1: Train and test on the entire dataset
1. Train the model on the entire dataset.
2. Test the model on the same dataset, and evaluate how well we did by comparing the predicted response values with the true response values
Logistic regression
Classification accuracy:
* Proportion of correct predictions
* Common evaluation metric for classification problems
Known as training accuracy when you train and test the model on the same data
KNN(K=5)
KNN(K=1)
Can we draw the conclusion that the KNN(1) has the best performance? KNN(1) has already remember the training data for each training entry:
* Goal is to estimate likely performance of a model on out-of-sample data
* But, maximizing training accuracy rewards overly complex models that won't necessary generalize
* Unnecessarily complex models overfit the training data
Green line impacted by noise and caused overfitting while the black line learn the signal and generally describe the training data set.
Evaluation procedure #2: Train/test split
1. Split the dataset into two pieces: a training set and a testing set
2. Train the model on the training set.
3. Test the model on the testing set, and evaluate how well we did.
What did this accomplish?
* Model can be trained and tested on different data (As out-of-sample)
* Response values are known for the training set, and thus predictions can be evaluated.
* Testing accuracy is a better estimate than training accuracy of out-of-sample performance.
Logistic performance
KNN(K=1) Performance
KNN(K=5) Performance
* Training accuracy rises as model complexity increase
* Testing accuracy penalizes models that are too complex or not complex enough.
* For KNN models, complexity is determined by the value of K (lower value = more complex)
Making predictions on out-of-sample data
- # try K=1 through K=25 and record testing accuracy
- k_range = range(1, 26)
- score = []
- for k in k_range:
- knn = KNeighborsClassifier(n_neighbors=k)
- knn.fit(X_train, y_train)
- y_pred = knn.predict(X_test)
- scores.append(metrics.accuracy_score(y_test, y_pred))
- import matplotlib.pyplot as plt
- # plot the relationship between K and testing accuracy
- plt.plot(k_range, scores)
- plt.xlabel('Value of K for KNN')
- plt.ylabel('Testing Accuracy')
Supplement
* Previous section - Training a machine learning model with scikit-learn
* Next section - Data science in Python: pandas, seaborn, scikit-learn
* Quora explanation of overfitting: http://www.quora.com/What-is-an-intuitive-explanat...-overfitting/answer/Jessica-Su
* Estimating prediction error (video): https://www.youtube.com/watch?v=_2ij6eaaSl0&t=2m34s
* Understanding the Bias-Variance Tradeoff: http://scott.fortmann-roe.com/docs/BiasVariance.html
* Guiding questions for that article: https://github.com/justmarkham/DAT5/blob/master/homework/06_bias_variance.md
* Visualizing bias and variance (video): http://work.caltech.edu/library/081.html
沒有留言:
張貼留言