## 2017年8月1日 星期二

### [Udemy] Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part1

Source From Here
Getting Start

Motivation
* ML/AI has become popular in recent years
* Amazing results: ML can analyze and predict disease on par with human expert
* AlphaGo/Deep Reinforcement Learning beat world champion at strategy game Go
* Self-driving cars -> will remove element of human error
* Google announced they are "machine learning first"
* ML is embedded into many different types of products in many industries
* Will open up a wide array of career opportunities

Outline
* Bootstrap
* Bagging (applying bootstrap to ML models)
* Random forest

Where to get the Code and Data (link
Github Location

All Data is the Same (link

* A lot of people ask "Why is there so much math in ML?"
* Sorry to burst your bubble: Machine learning IS math
* NOT plug-and-play into Scikit-learn
* At your real job, you will probably plug-and-play all the time
* But to be a good data scientist, this would be in addition to learning how the algorithms work
* To understand why ensembles are good for plug-and-play, you need theory from this course.

Irreducible error
* Data-generating processes are noisy
* Noise is by definition random (not deterministic)
* Can't predict its values, only its statistics (like mean & variance)

Bias
* Bias refers to the delta between your average model and the true f(x)
* Some sources refer to the square of this as bias, we won't: bias = E[f(x) - f_hat(x)]

Variance
* Has nothing to do with accuracy
* Variance just measures how "inconsistent" a predictor is, over different training sets
* Remember: goal is not to achieve lowest possible error
* Goal is to find true f(x)
* Being close to training points is only a proxy solution

Model complexity
* You might assume linear modes are not complex because nonlinear models are more "expressive"
* Linear doesn't necessarily mean not complex
* Large D linear model can be more complex than small D nonlinear model
* "Complexity" not a universal measurement

* In ML we strive to minimize error
* We've already seen the best we can do is the irreducible error
* We can achieve this when we know the true f(x)
* In this case the reducible part of the errors is 0
* Goal is to make bias and variance as small as possible!

* Is it possible to achieve lower bias and lower variance at the same time?
* Trade-off occurs in the context of altering the complexity of the same model
What if we combine models?

Expected eror = bias^2 +variance + irreducible error

Here use simple sample code (bias_variance_demo.py) to show the Bias-Variance trade-off by using different dimension of Poly features of linear regression. First is the few selection of result from different degree:

Below show the tendency of "Bias" and "Variance" while increasing the degree:
(degree up->bias down, variance up

Finally, the optimal value of degree will locate at the bottom of testing error line:

K-Nearest Neighbor and Decision Tree Demo (link
This part will use sample code (knn_dt_demo.py) to demonstrate the outlook on situations "Low bias & High variance" and "High Bias & Low variance" among decision tree and K-means result. Firstly, let's take a look on regression task:

Cross-Validation as a Method for Optimizing Model Complexity (link
Cross-Validation
* Cross-validation can help us to optimize the bias-variance trade-off
* We've already looked at cross-validation as a way of choosing hyperparameters.
* Motivation: we didn't just want good training error, we wanted good generalization error too.
* In polynomial regression example, we saw that test error coincides with sum of bias^2 + variance.
* So by optimizing test error, we optimize bias-variance as well

Sample code of K-Fold Cross Validation:
1. scores = []
2. sz = N / K # N is the number of records in dataset
3. for i in range(K):
4.     Xvalid, Yvalid = X[i*sz: (i+1)*sz], Y[i*sz: (i+1)*sz]
5.     Xtrain, Ytrain = np.concatenate((X[0:i*sz], X[(i+1)*sz:N]), axis=0), np.concatenate((Y[0:i*sz], Y[(i+1)*sz:N]), axis=0)
6.     model.fit(Xtrain, Ytrain)
7.     scores.append(model.score(Xvalid, Yvalid))
For Scikit-learn (Obtaining predictions by cross-validation):
1. from sklearn.model_selection import cross_val_predict
2. predicted = cross_val_predict(clf, iris.data, iris.target, cv=10)
3. metrics.accuracy_score(iris.target, predicted)

Supplement
ML In Action - Improving classification with the AdaBoost meta-algorithm
Bias, Variance, and Overfitting – Machine Learning Overview part 4 of 4
Scikit- learn - Selecting the best model in scikit-learn using cross-validation
Intro2ML - Ch6. Model Evaluation and Improvement - Cross Validation

### [LeetCode] Medium - 1043. Partition Array for Maximum Sum

Source From  Here Question Given an integer array  A , you partition the array into ( contiguous ) subarrays of length at most  K . After ... 