2017年8月8日 星期二

[Udemy] Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part2

Source From Here
Bootstrap Estimates and Bagging

Bootstrap Estimation (link)
* I hinted earlier that combining several models could help us get lower bias and lower variance at the same time
* Key tool we need first: bootstrapping
* aka. resampling
*Fascinating result (Same data/ Calculate the same thing several times / Better results)
* But first, let's look at bootstrap for simple parameter estimates like mean


Bootstrap Demo (link)
Sample code (bootstrap.py) show the bootstrap sampling will have close mean and standard deviation values as the original data set:



Bagging (link)
Bagging
* Bagging = bootstrap aggregating = application of bootstrap to ML models
* Look exactly like bootstrapping except instead of calculating a "theta_hat", we train a model instead

Training Pseudo Code
  1. models = []  
  2. for b=1..B:  
  3.     model = Model()  
  4.     Xb, Yb = resample(X)  
  5.     model.fit(Xb, Yb)  
  6.     models.append(model)  
Prediction Pseudo Code
Average if regression, vote if classification
  1. # regression  
  2. def predict(X):  
  3.     return np.mean([model.predict(X) for model in models], axis=1)  
Classification is harder b/c we need to collect the votes. If classifier returns class probabilities, we can just use averaging.
  1. # Naive classification  
  2. def predict_one(X):  
  3.     votes = {}  
  4.     for model in models:  
  5.         k = model.predict(X)  
  6.         votes[k]++  
  7.     argmax = 0 # don't sort, that's O(NlogN)  
  8.     for k, v in votes.iteritems():  
  9.         if v >argmax:  
  10.             argmax = k  
  11.     return k  
Another approach:
  1. def predict(X):  
  2.     output = np.zeros((N,K))  
  3.     got model in models:  
  4.         output[np.arange(N), model.predict(X)]+=1  
  5.     return output.argmax(axis=1)  
For Binary Classification:
  1. def predict(X):  
  2.     output = np.zeros(N)  
  3.     for model in models:  
  4.         output +=models.predict(X)  
  5.     return np.round(output/B)  

Bagging Regression Trees (link)
For sample code bagging_regression.py,



Bagging Classification Trees (link)
For sample code bagging_classification.py,



Stacking (link)
Stacking
Stacking is another way of combining models. We've assumed so far that each model's influence must be equal. How about weighting them?



Stacking is not the only way to find these weights, we'll explore another later. As usual, we want to minimize MSE:



Random Forest

Random Forest Algorithm (link)
Recall bagging: by having many trees of arbitrary depth, we can ensure they overfit (0 bias) to their own training samples (and thus will probably be very different from each other). Is there anything else we can do to ensure decorrelation? (other than just letting each tree overfit)
* We can achieve low bias easily with trees simply by adding more nodes
* Suppose each tree in ensemble has low bias
* Each tree has same expected value, then expected value of sum of trees also has the same expected value
* So the ensemble also has low bias (Later we will see how we can combine trees with high bias)

How does random forest decorrelate the trees?
* Before: we randomly choose which samples to train on
* Now: We can also choose which features to train on!
* How many features do we choose? d << D


* Recommendations by inventors
* Classification: As low as 1
* Regression: As low as 5
* BEST: what works for your specific dataset

Random Forest Training Pseudo Code
  1. for b=1..B:  
  2.     Xb, Yb = sample_with_replacement(X, Y)  
  3.     model = DecisionTree()  
  4.     while not at terminal node and not reached max_depth:  
  5.         select d features randomly  
  6.         choose best split from the d features (i.e. max information gain)  
  7.         add split to model  
  8.     models.append(model)  
More about Random Forest
* Just like bagging, we need to get bootstrap sample
* Sometimes RF is called "Feature bagging"
* There are NOT ensembles of vanilla decision trees
* We've changed how they make splits
* So you can't build a random forest using built-in decision tree class
* We won't build random forest in this course. But you already have all the skills you need and can give it a try.
* You can leverage exist solution as sklearn.ensemble.RandomForestClassifier
* Big advantage: Requires very little tuning
* Cal let all trees go to arbitrary depth without incurring much penalty
* Perform well, are fast
* When people come to deep learning in search of an API, I just recommended random forest instead
* Neural networks have many more hyperparameters, are sensitive to those choices.

Random Forest Regressor (link)
For sample code rf_regression.py, (dataset) it firstly defines the columns while the raw data doesn't contain column information:
  1. import numpy as np  
  2. import pandas as pd  
  3. import os  
  4. import matplotlib.pyplot as plt  
  5. from sklearn.preprocessing import LabelEncoder, StandardScaler  
  6. from sklearn.ensemble import RandomForestRegressor  
  7. from sklearn.linear_model import LinearRegression  
  8. from sklearn.tree import DecisionTreeRegressor  
  9. from sklearn.model_selection import cross_val_score  
  10.   
  11. NUMERICAL_COLS = [  
  12.   'crim', # numerical  
  13.   'zn', # numerical  
  14.   'nonretail', # numerical  
  15.   'nox', # numerical  
  16.   'rooms', # numerical  
  17.   'age', # numerical  
  18.   'dis', # numerical  
  19.   'rad', # numerical  
  20.   'tax', # numerical  
  21.   'ptratio', # numerical  
  22.   'b', # numerical  
  23.   'lstat', # numerical  
  24. ]  
  25.   
  26. NO_TRANSFORM = ['river']  
Then it defines a class DataTransformer for data normalization:
  1. class DataTransformer:  
  2.   def fit(self, df):  
  3.     self.scalers = {}  
  4.     for col in NUMERICAL_COLS:  
  5.       scaler = StandardScaler()  
  6.       scaler.fit(df[col].as_matrix().reshape(-11))  
  7.       self.scalers[col] = scaler  
  8.   
  9.   def transform(self, df):  
  10.     N, D = df.shape  
  11.     X = np.zeros((N, D))  
  12.     i = 0  
  13.     for col, scaler in self.scalers.items():  
  14.       X[:,i] = scaler.transform(df[col].as_matrix().reshape(-11)).flatten()  
  15.       i += 1  
  16.     for col in NO_TRANSFORM:  
  17.       X[:,i] = df[col]  
  18.       i += 1  
  19.     return X  
  20.   
  21.   def fit_transform(self, df):  
  22.     self.fit(df)  
  23.     return self.transform(df)  
Then it defines API:get_data:
  1. def get_data():  
  2.   # regex allows arbitrary number of spaces in separator  
  3.   df = pd.read_csv('../large_files/housing.data', header=None, sep=r"\s*", engine='python')  
  4.   df.columns = [  
  5.     'crim', # numerical  
  6.     'zn', # numerical  
  7.     'nonretail', # numerical  
  8.     'river', # binary  
  9.     'nox', # numerical  
  10.     'rooms', # numerical  
  11.     'age', # numerical  
  12.     'dis', # numerical  
  13.     'rad', # numerical  
  14.     'tax', # numerical  
  15.     'ptratio', # numerical  
  16.     'b', # numerical  
  17.     'lstat', # numerical  
  18.     'medv', # numerical -- this is the target  
  19.   ]  
  20.   
  21.   # transform the data  
  22.   transformer = DataTransformer()  
  23.   
  24.   # shuffle the data  
  25.   N = len(df)  
  26.   train_idx = np.random.choice(N, size=int(0.7*N), replace=False)  
  27.   test_idx = [i for i in range(N) if i not in train_idx]  
  28.   df_train = df.loc[train_idx]  
  29.   df_test = df.loc[test_idx]  
  30.   
  31.   Xtrain = transformer.fit_transform(df_train)  
  32.   Ytrain = np.log(df_train['medv'].as_matrix())  
  33.   Xtest = transformer.transform(df_test)  
  34.   Ytest = np.log(df_test['medv'].as_matrix())  
  35.   return Xtrain, Ytrain, Xtest, Ytest  
which will:
* Loading data from csv file into DataFrame object
* Shuffle the data and split it into training/testing part
* Normalize the value of features and transform the expected result into log value.

Finally, the main program:
  1. if __name__ == '__main__':  
  2.   Xtrain, Ytrain, Xtest, Ytest = get_data()  
  3.   
  4.   model = RandomForestRegressor(n_estimators=100) # try 102050100200  
  5.   model.fit(Xtrain, Ytrain)  
  6.   predictions = model.predict(Xtest)  
  7.   
  8.   if 'DISPLAY' in os.environ:  
  9.     # plot predictions vs targets  
  10.     plt.scatter(Ytest, predictions)  
  11.     plt.xlabel("target")  
  12.     plt.ylabel("prediction")  
  13.     ymin = np.round( min( min(Ytest), min(predictions) ) )  
  14.     ymax = np.ceil( max( max(Ytest), max(predictions) ) )  
  15.     print("ymin:{}; ymax:{}".format(ymin, ymax))  
  16.     r = range(int(ymin), int(ymax) + 1)  
  17.     plt.plot(r, r)  
  18.     plt.show()  
  19.   
  20.     plt.plot(Ytest, label='targets')  
  21.     plt.plot(predictions, label='predictions')  
  22.     plt.legend()  
  23.     plt.show()  
  24.   
  25.   # do a quick baseline test  
  26.   baseline = LinearRegression()  
  27.   single_tree = DecisionTreeRegressor()  
  28.   print("CV single tree: {}".format(cross_val_score(single_tree, Xtrain, Ytrain).mean()))  
  29.   print("CV baseline: {}".format(cross_val_score(baseline, Xtrain, Ytrain).mean()))  
  30.   print("CV forest: {}".format(cross_val_score(model, Xtrain, Ytrain).mean()))  
  31.   
  32.   # test score  
  33.   single_tree.fit(Xtrain, Ytrain)  
  34.   baseline.fit(Xtrain, Ytrain)  
  35.   print("test score single tree: {}".format(single_tree.score(Xtest, Ytest)))  
  36.   print("test score baseline: {}".format(baseline.score(Xtest, Ytest)))  
  37.   print("test score forest: {}".format(model.score(Xtest, Ytest)))  
The prediction result:



The corresponding cross validation and scores:
CV single tree: 0.647716136653706
CV baseline: 0.7317741767314697
CV forest: 0.7972932633216651
test score single tree: 0.7812057781690182
test score baseline: 0.8319750076781687
test score forest: 0.901392836040053


Random Forest Classifier (link)
For sample code rf_classification.py, (dataset), it firstly defines the indices of categorical feature and indices of numeric feature:
  1. import numpy as np  
  2. import pandas as pd  
  3. import matplotlib.pyplot as plt  
  4. from sklearn.preprocessing import LabelEncoder, StandardScaler  
  5. from sklearn.tree import DecisionTreeClassifier  
  6. from sklearn.ensemble import RandomForestClassifier  
  7. from sklearn.linear_model import LogisticRegression  
  8. from sklearn.model_selection import cross_val_score  
  9.   
  10. NUMERICAL_COLS = ()  
  11. # https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names  
  12. CATEGORICAL_COLS = np.arange(22) + 1 # 1..22 inclusive  
Then is the class for transformation/normalization:
  1. # transforms data from dataframe to numerical matrix  
  2. # one-hot encodes categories and normalizes numerical columns  
  3. # we want to use the scales found in training when transforming the test set  
  4. # so only call fit() once  
  5. # call transform() for any subsequent data  
  6. class DataTransformer:  
  7.   def fit(self, df):  
  8.     self.labelEncoders = {}  
  9.     self.scalers = {}  
  10.     for col in NUMERICAL_COLS:  
  11.       scaler = StandardScaler()  
  12.       scaler.fit(df[col].reshape(-11))  
  13.       self.scalers[col] = scaler  
  14.   
  15.     for col in CATEGORICAL_COLS:  
  16.       encoder = LabelEncoder()  
  17.       # in case the train set does not have 'missing' value but test set does  
  18.       values = df[col].tolist()  
  19.       values.append('missing')  
  20.       encoder.fit(values)  
  21.       self.labelEncoders[col] = encoder  
  22.   
  23.     # find dimensionality  
  24.     self.D = len(NUMERICAL_COLS)  
  25.     for col, encoder in self.labelEncoders.items():  
  26.       self.D += len(encoder.classes_)  
  27.     print("dimensionality: {}".format(self.D))  
  28.   
  29.   def transform(self, df):  
  30.     N, _ = df.shape  
  31.     X = np.zeros((N, self.D))  
  32.     i = 0  
  33.     for col, scaler in self.scalers.items():  
  34.       X[:,i] = scaler.transform(df[col].as_matrix().reshape(-11)).flatten()  
  35.       i += 1  
  36.   
  37.     for col, encoder in self.labelEncoders.items():  
  38.       # print "transforming col:", col  
  39.       K = len(encoder.classes_)  
  40.       X[np.arange(N), encoder.transform(df[col]) + i] = 1  
  41.       i += K  
  42.     return X  
  43.   
  44.   def fit_transform(self, df):  
  45.     self.fit(df)  
  46.     return self.transform(df)  
Then it defines APIs to load data and replace missing values:
  1. def replace_missing(df):  
  2.   # standard method of replacement for numerical columns is median  
  3.   for col in NUMERICAL_COLS:  
  4.     if np.any(df[col].isnull()):  
  5.       med = np.median(df[ col ][ df[col].notnull() ])  
  6.       df.loc[ df[col].isnull(), col ] = med  
  7.   
  8.   # set a special value = 'missing'  
  9.   for col in CATEGORICAL_COLS:  
  10.     if np.any(df[col].isnull()):  
  11.       print("Column: {}".format(col))  
  12.       df.loc[ df[col].isnull(), col ] = 'missing'  
  13.   
  14.   
  15. def get_data():  
  16.   df = pd.read_csv('../large_files/mushroom.data', header=None)  
  17.   
  18.   # replace label column: e/p --> 0/1  
  19.   # e = edible = 0, p = poisonous = 1  
  20.   df[0] = df.apply(lambda row: 0 if row[0] == 'e' else 1, axis=1)  
  21.   
  22.   # check if there is missing data  
  23.   replace_missing(df)  
  24.   
  25.   # transform the data  
  26.   transformer = DataTransformer()  
  27.   
  28.   X = transformer.fit_transform(df)  
  29.   Y = df[0].as_matrix()  
  30.   return X, Y  
Finally, the main program:
  1. if __name__ == '__main__':  
  2.   X, Y = get_data()  
  3.   
  4.   # do a quick baseline test  
  5.   baseline = LogisticRegression()  
  6.   print("CV baseline: {}".format(cross_val_score(baseline, X, Y, cv=8).mean()))  
  7.   
  8.   # single tree  
  9.   tree = DecisionTreeClassifier()  
  10.   print("CV one tree: {}".format(cross_val_score(tree, X, Y, cv=8).mean()))  
  11.   
  12.   model = RandomForestClassifier(n_estimators=20) # try 102050100200  
  13.   print("CV forest: {}".format(cross_val_score(model, X, Y, cv=8).mean()))  
The execution result:
dimensionality: 139
CV baseline: 0.9274806301152012
CV one tree: 0.9308194503704279
CV forest: 0.9354990108994996


Random Forest vs Bagging Trees (link)
Here are going to compare the performance of approach between Bagging Trees & Random Forest (rf_vs_bag.py). First is for regression task:


(X axis is the number of trees; Y axis is the score of regression)

Then is the classification task:


(X axis is the number of trees; Y axis is the accuracy of classification)

Implementing a "Not as Random" Forest (link)
Here we try to mix ideas of Bagging and Random Forest and come up with a class NotAsRandomForest which will train multiple Decision Tree model which adopting bootstrap sampling (rf_vs_bag2.py).



Connection to Deep Learning: Dropout (link)
Dropout Regularization
* Each node can be "used" or "not used"
* 2 states of being for each node
* If Neural network has N nodes, then there are 2^N different possibilities


More
* Dropout "emulates" an ensemble of 2^N networks by randomly dropping nodes.
* During training: Drop node with probability p(drop)
* During prediction: Don't drop anything, multiply every layer by 1 - p(drop)
* Allows you to make a ensemble without making an ensemble
* Similar to random forest: randomly selecting which features to look at
* If you're doing batch training, the sub-network is also seeing a random subset of training samples also.


Supplement
Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part1
Intro2ML - Ch2. Supervised Learning - Ensembles of Decision Trees
Ensembles are methods that combine multiple machine learning models to create more powerful models. There are many models in the machine learning literature that belong to this category, but there are two ensemble models that have proven to be effective on a wide range of datasets for classification and regression, both of which use decision trees as their building block: Random Forests and Gradient Boosted Decision Trees.

This message was edited 70 times. Last update was at 08/08/2017 15:53:32

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...