程式扎記: [Udemy] Ensemble Machine Learning in Python: Random Forest, AdaBoost

2017年8月1日星期二

[Udemy] Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part1

Source From Here
Getting Start

Outline & Motivation (link)
Motivation

* ML/AI has become popular in recent years
* Amazing results: ML can analyze and predict disease on par with human expert
* AlphaGo/Deep Reinforcement Learning beat world champion at strategy game Go
* Self-driving cars -> will remove element of human error
* Google announced they are "machine learning first"
* ML is embedded into many different types of products in many industries
* Will open up a wide array of career opportunities

Outline

* Bias-variance trade-off
* Bootstrap
* Bagging (applying bootstrap to ML models)
* Random forest
* AdaBoost

Where to get the Code and Data (link)
* Github Location

All Data is the Same (link)

Plug-and-Play (link)

* A lot of people ask "Why is there so much math in ML?"
* Sorry to burst your bubble: Machine learning IS math
* NOT plug-and-play into Scikit-learn
* At your real job, you will probably plug-and-play all the time
* But to be a good data scientist, this would be in addition to learning how the algorithms work
* To understand why ensembles are good for plug-and-play, you need theory from this course.

Bias-Variance Trade-Off

Bias-Variance Key Terms (link)

Irreducible error

* Data-generating processes are noisy
* Noise is by definition random (not deterministic)
* Can't predict its values, only its statistics (like mean & variance)

Bias

* Bias refers to the delta between your average model and the true f(x)
* Some sources refer to the square of this as bias, we won't: bias = E[f(x) - f_hat(x)]

Variance

* Has nothing to do with accuracy
* Variance just measures how "inconsistent" a predictor is, over different training sets
* Remember: goal is not to achieve lowest possible error
* Goal is to find true f(x)
* Being close to training points is only a proxy solution

Model complexity

* You might assume linear modes are not complex because nonlinear models are more "expressive"
* Linear doesn't necessarily mean not complex
* Large D linear model can be more complex than small D nonlinear model
* "Complexity" not a universal measurement

Bias-Variance Trade-Off (link)

* In ML we strive to minimize error
* We've already seen the best we can do is the irreducible error
* We can achieve this when we know the true f(x)
* In this case the reducible part of the errors is 0
* Goal is to make bias and variance as small as possible!

* Is it possible to achieve lower bias and lower variance at the same time?
* Trade-off occurs in the context of altering the complexity of the same model
* What if we combine models?

Bias-Variance Decomposition (link)

Expected eror = bias^2 +variance + irreducible error

Polynomial Regression Demo (link)
Here use simple sample code (bias_variance_demo.py) to show the Bias-Variance trade-off by using different dimension of Poly features of linear regression. First is the few selection of result from different degree:

Below show the tendency of "Bias" and "Variance" while increasing the degree:

(degree up->bias down, variance up)

Finally, the optimal value of degree will locate at the bottom of testing error line:

K-Nearest Neighbor and Decision Tree Demo (link)
This part will use sample code (knn_dt_demo.py) to demonstrate the outlook on situations "Low bias & High variance" and "High Bias & Low variance" among decision tree and K-means result. Firstly, let's take a look on regression task:

Then is the classification task:

Cross-Validation as a Method for Optimizing Model Complexity (link)
Cross-Validation

* Cross-validation can help us to optimize the bias-variance trade-off
* We've already looked at cross-validation as a way of choosing hyperparameters.
* Motivation: we didn't just want good training error, we wanted good generalization error too.
* In polynomial regression example, we saw that test error coincides with sum of bias^2 + variance.
* So by optimizing test error, we optimize bias-variance as well

Sample code of K-Fold Cross Validation:

view plaincopy to clipboardprint?
scores = []  
sz = N / K # N is the number of records in dataset  
for i in range(K):  
    Xvalid, Yvalid = X[i*sz: (i+1)*sz], Y[i*sz: (i+1)*sz]  
    Xtrain, Ytrain = np.concatenate((X[0:i*sz], X[(i+1)*sz:N]), axis=0), np.concatenate((Y[0:i*sz], Y[(i+1)*sz:N]), axis=0)  
    model.fit(Xtrain, Ytrain)  
    scores.append(model.score(Xvalid, Yvalid))  

For Scikit-learn (Obtaining predictions by cross-validation):

view plaincopy to clipboardprint?
from sklearn.model_selection import cross_val_predict  
predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)  
metrics.accuracy_score(iris.target, predicted)   

Supplement
* ML In Action - Improving classification with the AdaBoost meta-algorithm
* Bias, Variance, and Overfitting – Machine Learning Overview part 4 of 4
* Scikit- learn - Selecting the best model in scikit-learn using cross-validation
* Intro2ML - Ch6. Model Evaluation and Improvement - Cross Validation

程式扎記

標籤

2017年8月1日星期二

[Udemy] Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part1

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年8月1日 星期二

[Udemy] Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part1

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2017年8月1日星期二