Source From Here
Bootstrap Estimates and Bagging
Bootstrap Estimation (link)
Bootstrap Demo (link)
Sample code (bootstrap.py) show the bootstrap sampling will have close mean and standard deviation values as the original data set:
Bagging (link)
Bagging
Training Pseudo Code
Prediction Pseudo Code
Average if regression, vote if classification
Classification is harder b/c we need to collect the votes. If classifier returns class probabilities, we can just use averaging.
Another approach:
For Binary Classification:
Bagging Regression Trees (link)
For sample code bagging_regression.py,
Bagging Classification Trees (link)
For sample code bagging_classification.py,
Stacking (link)
Stacking
Stacking is another way of combining models. We've assumed so far that each model's influence must be equal. How about weighting them?
Stacking is not the only way to find these weights, we'll explore another later. As usual, we want to minimize MSE:
Random Forest
Random Forest Algorithm (link)
Recall bagging: by having many trees of arbitrary depth, we can ensure they overfit (0 bias) to their own training samples (and thus will probably be very different from each other). Is there anything else we can do to ensure decorrelation? (other than just letting each tree overfit)
How does random forest decorrelate the trees?
Random Forest Training Pseudo Code
More about Random Forest
Random Forest Regressor (link)
For sample code rf_regression.py, (dataset) it firstly defines the columns while the raw data doesn't contain column information:
Then it defines a class
DataTransformer for data normalization:
Then it defines API:get_data:
which will:
Finally, the main program:
The prediction result:
The corresponding cross validation and scores:
Random Forest Classifier (link)
For sample code rf_classification.py, (dataset), it firstly defines the indices of categorical feature and indices of numeric feature:
Then is the class for transformation/normalization:
Then it defines APIs to load data and replace missing values:
Finally, the main program:
The execution result:
Random Forest vs Bagging Trees (link)
Here are going to compare the performance of approach between Bagging Trees & Random Forest (rf_vs_bag.py). First is for regression task:
(X axis is the number of trees; Y axis is the score of regression)
Then is the classification task:
(X axis is the number of trees; Y axis is the accuracy of classification)
Implementing a "Not as Random" Forest (link)
Here we try to mix ideas of Bagging and Random Forest and come up with a class NotAsRandomForest which will train multiple Decision Tree model which adopting bootstrap sampling (rf_vs_bag2.py).
Connection to Deep Learning: Dropout (link)
Dropout Regularization
More
Supplement
* Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part1
* Intro2ML - Ch2. Supervised Learning - Ensembles of Decision Trees
Bootstrap Estimates and Bagging
Bootstrap Estimation (link)
Bootstrap Demo (link)
Sample code (bootstrap.py) show the bootstrap sampling will have close mean and standard deviation values as the original data set:
Bagging (link)
Bagging
Training Pseudo Code
- models = []
- for b=1..B:
- model = Model()
- Xb, Yb = resample(X)
- model.fit(Xb, Yb)
- models.append(model)
Average if regression, vote if classification
- # regression
- def predict(X):
- return np.mean([model.predict(X) for model in models], axis=1)
- # Naive classification
- def predict_one(X):
- votes = {}
- for model in models:
- k = model.predict(X)
- votes[k]++
- argmax = 0 # don't sort, that's O(NlogN)
- for k, v in votes.iteritems():
- if v >argmax:
- argmax = k
- return k
- def predict(X):
- output = np.zeros((N,K))
- got model in models:
- output[np.arange(N), model.predict(X)]+=1
- return output.argmax(axis=1)
- def predict(X):
- output = np.zeros(N)
- for model in models:
- output +=models.predict(X)
- return np.round(output/B)
Bagging Regression Trees (link)
For sample code bagging_regression.py,
Bagging Classification Trees (link)
For sample code bagging_classification.py,
Stacking (link)
Stacking
Stacking is another way of combining models. We've assumed so far that each model's influence must be equal. How about weighting them?
Stacking is not the only way to find these weights, we'll explore another later. As usual, we want to minimize MSE:
Random Forest
Random Forest Algorithm (link)
Recall bagging: by having many trees of arbitrary depth, we can ensure they overfit (0 bias) to their own training samples (and thus will probably be very different from each other). Is there anything else we can do to ensure decorrelation? (other than just letting each tree overfit)
How does random forest decorrelate the trees?
Random Forest Training Pseudo Code
- for b=1..B:
- Xb, Yb = sample_with_replacement(X, Y)
- model = DecisionTree()
- while not at terminal node and not reached max_depth:
- select d features randomly
- choose best split from the d features (i.e. max information gain)
- add split to model
- models.append(model)
Random Forest Regressor (link)
For sample code rf_regression.py, (dataset) it firstly defines the columns while the raw data doesn't contain column information:
- import numpy as np
- import pandas as pd
- import os
- import matplotlib.pyplot as plt
- from sklearn.preprocessing import LabelEncoder, StandardScaler
- from sklearn.ensemble import RandomForestRegressor
- from sklearn.linear_model import LinearRegression
- from sklearn.tree import DecisionTreeRegressor
- from sklearn.model_selection import cross_val_score
- NUMERICAL_COLS = [
- 'crim', # numerical
- 'zn', # numerical
- 'nonretail', # numerical
- 'nox', # numerical
- 'rooms', # numerical
- 'age', # numerical
- 'dis', # numerical
- 'rad', # numerical
- 'tax', # numerical
- 'ptratio', # numerical
- 'b', # numerical
- 'lstat', # numerical
- ]
- NO_TRANSFORM = ['river']
- class DataTransformer:
- def fit(self, df):
- self.scalers = {}
- for col in NUMERICAL_COLS:
- scaler = StandardScaler()
- scaler.fit(df[col].as_matrix().reshape(-1, 1))
- self.scalers[col] = scaler
- def transform(self, df):
- N, D = df.shape
- X = np.zeros((N, D))
- i = 0
- for col, scaler in self.scalers.items():
- X[:,i] = scaler.transform(df[col].as_matrix().reshape(-1, 1)).flatten()
- i += 1
- for col in NO_TRANSFORM:
- X[:,i] = df[col]
- i += 1
- return X
- def fit_transform(self, df):
- self.fit(df)
- return self.transform(df)
- def get_data():
- # regex allows arbitrary number of spaces in separator
- df = pd.read_csv('../large_files/housing.data', header=None, sep=r"\s*", engine='python')
- df.columns = [
- 'crim', # numerical
- 'zn', # numerical
- 'nonretail', # numerical
- 'river', # binary
- 'nox', # numerical
- 'rooms', # numerical
- 'age', # numerical
- 'dis', # numerical
- 'rad', # numerical
- 'tax', # numerical
- 'ptratio', # numerical
- 'b', # numerical
- 'lstat', # numerical
- 'medv', # numerical -- this is the target
- ]
- # transform the data
- transformer = DataTransformer()
- # shuffle the data
- N = len(df)
- train_idx = np.random.choice(N, size=int(0.7*N), replace=False)
- test_idx = [i for i in range(N) if i not in train_idx]
- df_train = df.loc[train_idx]
- df_test = df.loc[test_idx]
- Xtrain = transformer.fit_transform(df_train)
- Ytrain = np.log(df_train['medv'].as_matrix())
- Xtest = transformer.transform(df_test)
- Ytest = np.log(df_test['medv'].as_matrix())
- return Xtrain, Ytrain, Xtest, Ytest
Finally, the main program:
- if __name__ == '__main__':
- Xtrain, Ytrain, Xtest, Ytest = get_data()
- model = RandomForestRegressor(n_estimators=100) # try 10, 20, 50, 100, 200
- model.fit(Xtrain, Ytrain)
- predictions = model.predict(Xtest)
- if 'DISPLAY' in os.environ:
- # plot predictions vs targets
- plt.scatter(Ytest, predictions)
- plt.xlabel("target")
- plt.ylabel("prediction")
- ymin = np.round( min( min(Ytest), min(predictions) ) )
- ymax = np.ceil( max( max(Ytest), max(predictions) ) )
- print("ymin:{}; ymax:{}".format(ymin, ymax))
- r = range(int(ymin), int(ymax) + 1)
- plt.plot(r, r)
- plt.show()
- plt.plot(Ytest, label='targets')
- plt.plot(predictions, label='predictions')
- plt.legend()
- plt.show()
- # do a quick baseline test
- baseline = LinearRegression()
- single_tree = DecisionTreeRegressor()
- print("CV single tree: {}".format(cross_val_score(single_tree, Xtrain, Ytrain).mean()))
- print("CV baseline: {}".format(cross_val_score(baseline, Xtrain, Ytrain).mean()))
- print("CV forest: {}".format(cross_val_score(model, Xtrain, Ytrain).mean()))
- # test score
- single_tree.fit(Xtrain, Ytrain)
- baseline.fit(Xtrain, Ytrain)
- print("test score single tree: {}".format(single_tree.score(Xtest, Ytest)))
- print("test score baseline: {}".format(baseline.score(Xtest, Ytest)))
- print("test score forest: {}".format(model.score(Xtest, Ytest)))
The corresponding cross validation and scores:
Random Forest Classifier (link)
For sample code rf_classification.py, (dataset), it firstly defines the indices of categorical feature and indices of numeric feature:
- import numpy as np
- import pandas as pd
- import matplotlib.pyplot as plt
- from sklearn.preprocessing import LabelEncoder, StandardScaler
- from sklearn.tree import DecisionTreeClassifier
- from sklearn.ensemble import RandomForestClassifier
- from sklearn.linear_model import LogisticRegression
- from sklearn.model_selection import cross_val_score
- NUMERICAL_COLS = ()
- # https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names
- CATEGORICAL_COLS = np.arange(22) + 1 # 1..22 inclusive
- # transforms data from dataframe to numerical matrix
- # one-hot encodes categories and normalizes numerical columns
- # we want to use the scales found in training when transforming the test set
- # so only call fit() once
- # call transform() for any subsequent data
- class DataTransformer:
- def fit(self, df):
- self.labelEncoders = {}
- self.scalers = {}
- for col in NUMERICAL_COLS:
- scaler = StandardScaler()
- scaler.fit(df[col].reshape(-1, 1))
- self.scalers[col] = scaler
- for col in CATEGORICAL_COLS:
- encoder = LabelEncoder()
- # in case the train set does not have 'missing' value but test set does
- values = df[col].tolist()
- values.append('missing')
- encoder.fit(values)
- self.labelEncoders[col] = encoder
- # find dimensionality
- self.D = len(NUMERICAL_COLS)
- for col, encoder in self.labelEncoders.items():
- self.D += len(encoder.classes_)
- print("dimensionality: {}".format(self.D))
- def transform(self, df):
- N, _ = df.shape
- X = np.zeros((N, self.D))
- i = 0
- for col, scaler in self.scalers.items():
- X[:,i] = scaler.transform(df[col].as_matrix().reshape(-1, 1)).flatten()
- i += 1
- for col, encoder in self.labelEncoders.items():
- # print "transforming col:", col
- K = len(encoder.classes_)
- X[np.arange(N), encoder.transform(df[col]) + i] = 1
- i += K
- return X
- def fit_transform(self, df):
- self.fit(df)
- return self.transform(df)
- def replace_missing(df):
- # standard method of replacement for numerical columns is median
- for col in NUMERICAL_COLS:
- if np.any(df[col].isnull()):
- med = np.median(df[ col ][ df[col].notnull() ])
- df.loc[ df[col].isnull(), col ] = med
- # set a special value = 'missing'
- for col in CATEGORICAL_COLS:
- if np.any(df[col].isnull()):
- print("Column: {}".format(col))
- df.loc[ df[col].isnull(), col ] = 'missing'
- def get_data():
- df = pd.read_csv('../large_files/mushroom.data', header=None)
- # replace label column: e/p --> 0/1
- # e = edible = 0, p = poisonous = 1
- df[0] = df.apply(lambda row: 0 if row[0] == 'e' else 1, axis=1)
- # check if there is missing data
- replace_missing(df)
- # transform the data
- transformer = DataTransformer()
- X = transformer.fit_transform(df)
- Y = df[0].as_matrix()
- return X, Y
- if __name__ == '__main__':
- X, Y = get_data()
- # do a quick baseline test
- baseline = LogisticRegression()
- print("CV baseline: {}".format(cross_val_score(baseline, X, Y, cv=8).mean()))
- # single tree
- tree = DecisionTreeClassifier()
- print("CV one tree: {}".format(cross_val_score(tree, X, Y, cv=8).mean()))
- model = RandomForestClassifier(n_estimators=20) # try 10, 20, 50, 100, 200
- print("CV forest: {}".format(cross_val_score(model, X, Y, cv=8).mean()))
Random Forest vs Bagging Trees (link)
Here are going to compare the performance of approach between Bagging Trees & Random Forest (rf_vs_bag.py). First is for regression task:
(X axis is the number of trees; Y axis is the score of regression)
Then is the classification task:
(X axis is the number of trees; Y axis is the accuracy of classification)
Implementing a "Not as Random" Forest (link)
Here we try to mix ideas of Bagging and Random Forest and come up with a class NotAsRandomForest which will train multiple Decision Tree model which adopting bootstrap sampling (rf_vs_bag2.py).
Connection to Deep Learning: Dropout (link)
Dropout Regularization
More
Supplement
* Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part1
* Intro2ML - Ch2. Supervised Learning - Ensembles of Decision Trees
This message was edited 70 times. Last update was at 08/08/2017 15:53:32
沒有留言:
張貼留言