程式扎記

Source From Here
Bootstrap Estimates and Bagging

Bootstrap Estimation (link)

* I hinted earlier that combining several models could help us get lower bias and lower variance at the same time
* Key tool we need first: bootstrapping
* aka. resampling
*Fascinating result (Same data/ Calculate the same thing several times / Better results)
* But first, let's look at bootstrap for simple parameter estimates like mean

Bootstrap Demo (link)
Sample code (bootstrap.py) show the bootstrap sampling will have close mean and standard deviation values as the original data set:

Bagging (link)
Bagging

* Bagging = bootstrap aggregating = application of bootstrap to ML models
* Look exactly like bootstrapping except instead of calculating a "theta_hat", we train a model instead

Training Pseudo Code

view plaincopy to clipboardprint?
models = []  
for b=1..B:  
    model = Model()  
    Xb, Yb = resample(X)  
    model.fit(Xb, Yb)  
    models.append(model)  

Prediction Pseudo Code
Average if regression, vote if classification

view plaincopy to clipboardprint?
# regression  
def predict(X):  
    return np.mean([model.predict(X) for model in models], axis=1)  

Classification is harder b/c we need to collect the votes. If classifier returns class probabilities, we can just use averaging.

view plaincopy to clipboardprint?
# Naive classification  
def predict_one(X):  
    votes = {}  
    for model in models:  
        k = model.predict(X)  
        votes[k]++  
    argmax = 0 # don't sort, that's O(NlogN)  
    for k, v in votes.iteritems():  
        if v >argmax:  
            argmax = k  
    return k  

Another approach:

view plaincopy to clipboardprint?
def predict(X):  
    output = np.zeros((N,K))  
    got model in models:  
        output[np.arange(N), model.predict(X)]+=1  
    return output.argmax(axis=1)  

For Binary Classification:

view plaincopy to clipboardprint?
def predict(X):  
    output = np.zeros(N)  
    for model in models:  
        output +=models.predict(X)  
    return np.round(output/B)  

Bagging Regression Trees (link)
For sample code bagging_regression.py,

Bagging Classification Trees (link)
For sample code bagging_classification.py,

Stacking (link)
Stacking
Stacking is another way of combining models. We've assumed so far that each model's influence must be equal. How about weighting them?

Stacking is not the only way to find these weights, we'll explore another later. As usual, we want to minimize MSE:

Random Forest

Random Forest Algorithm (link)
Recall bagging: by having many trees of arbitrary depth, we can ensure they overfit (0 bias) to their own training samples (and thus will probably be very different from each other). Is there anything else we can do to ensure decorrelation? (other than just letting each tree overfit)

* We can achieve low bias easily with trees simply by adding more nodes
* Suppose each tree in ensemble has low bias
* Each tree has same expected value, then expected value of sum of trees also has the same expected value
* So the ensemble also has low bias (Later we will see how we can combine trees with high bias)

How does random forest decorrelate the trees?

* Before: we randomly choose which samples to train on
* Now: We can also choose which features to train on!
* How many features do we choose? d << D

* Recommendations by inventors
* Classification: As low as 1
* Regression: As low as 5
* BEST: what works for your specific dataset

Random Forest Training Pseudo Code

view plaincopy to clipboardprint?
for b=1..B:  
    Xb, Yb = sample_with_replacement(X, Y)  
    model = DecisionTree()  
    while not at terminal node and not reached max_depth:  
        select d features randomly  
        choose best split from the d features (i.e. max information gain)  
        add split to model  
    models.append(model)  

More about Random Forest

* Just like bagging, we need to get bootstrap sample
* Sometimes RF is called "Feature bagging"
* There are NOT ensembles of vanilla decision trees
* We've changed how they make splits
* So you can't build a random forest using built-in decision tree class
* We won't build random forest in this course. But you already have all the skills you need and can give it a try.
* You can leverage exist solution as sklearn.ensemble.RandomForestClassifier
* Big advantage: Requires very little tuning
* Cal let all trees go to arbitrary depth without incurring much penalty
* Perform well, are fast
* When people come to deep learning in search of an API, I just recommended random forest instead
* Neural networks have many more hyperparameters, are sensitive to those choices.

Random Forest Regressor (link)
For sample code rf_regression.py, (dataset) it firstly defines the columns while the raw data doesn't contain column information:

view plaincopy to clipboardprint?
import numpy as np  
import pandas as pd  
import os  
import matplotlib.pyplot as plt  
from sklearn.preprocessing import LabelEncoder, StandardScaler  
from sklearn.ensemble import RandomForestRegressor  
from sklearn.linear_model import LinearRegression  
from sklearn.tree import DecisionTreeRegressor  
from sklearn.model_selection import cross_val_score  
  
NUMERICAL_COLS = [  
  'crim', # numerical  
  'zn', # numerical  
  'nonretail', # numerical  
  'nox', # numerical  
  'rooms', # numerical  
  'age', # numerical  
  'dis', # numerical  
  'rad', # numerical  
  'tax', # numerical  
  'ptratio', # numerical  
  'b', # numerical  
  'lstat', # numerical  
]  
  
NO_TRANSFORM = ['river']  

Then it defines a class DataTransformer for data normalization:

view plaincopy to clipboardprint?
class DataTransformer:  
  def fit(self, df):  
    self.scalers = {}  
    for col in NUMERICAL_COLS:  
      scaler = StandardScaler()  
      scaler.fit(df[col].as_matrix().reshape(-1, 1))  
      self.scalers[col] = scaler  
  
  def transform(self, df):  
    N, D = df.shape  
    X = np.zeros((N, D))  
    i = 0  
    for col, scaler in self.scalers.items():  
      X[:,i] = scaler.transform(df[col].as_matrix().reshape(-1, 1)).flatten()  
      i += 1  
    for col in NO_TRANSFORM:  
      X[:,i] = df[col]  
      i += 1  
    return X  
  
  def fit_transform(self, df):  
    self.fit(df)  
    return self.transform(df)  

Then it defines API:get_data:

view plaincopy to clipboardprint?
def get_data():  
  # regex allows arbitrary number of spaces in separator  
  df = pd.read_csv('../large_files/housing.data', header=None, sep=r"\s*", engine='python')  
  df.columns = [  
    'crim', # numerical  
    'zn', # numerical  
    'nonretail', # numerical  
    'river', # binary  
    'nox', # numerical  
    'rooms', # numerical  
    'age', # numerical  
    'dis', # numerical  
    'rad', # numerical  
    'tax', # numerical  
    'ptratio', # numerical  
    'b', # numerical  
    'lstat', # numerical  
    'medv', # numerical -- this is the target  
  ]  
  
  # transform the data  
  transformer = DataTransformer()  
  
  # shuffle the data  
  N = len(df)  
  train_idx = np.random.choice(N, size=int(0.7*N), replace=False)  
  test_idx = [i for i in range(N) if i not in train_idx]  
  df_train = df.loc[train_idx]  
  df_test = df.loc[test_idx]  
  
  Xtrain = transformer.fit_transform(df_train)  
  Ytrain = np.log(df_train['medv'].as_matrix())  
  Xtest = transformer.transform(df_test)  
  Ytest = np.log(df_test['medv'].as_matrix())  
  return Xtrain, Ytrain, Xtest, Ytest  

which will:

* Loading data from csv file into DataFrame object
* Shuffle the data and split it into training/testing part
* Normalize the value of features and transform the expected result into log value.

Finally, the main program:

view plaincopy to clipboardprint?
if __name__ == '__main__':  
  Xtrain, Ytrain, Xtest, Ytest = get_data()  
  
  model = RandomForestRegressor(n_estimators=100) # try 10, 20, 50, 100, 200  
  model.fit(Xtrain, Ytrain)  
  predictions = model.predict(Xtest)  
  
  if 'DISPLAY' in os.environ:  
    # plot predictions vs targets  
    plt.scatter(Ytest, predictions)  
    plt.xlabel("target")  
    plt.ylabel("prediction")  
    ymin = np.round( min( min(Ytest), min(predictions) ) )  
    ymax = np.ceil( max( max(Ytest), max(predictions) ) )  
    print("ymin:{}; ymax:{}".format(ymin, ymax))  
    r = range(int(ymin), int(ymax) + 1)  
    plt.plot(r, r)  
    plt.show()  
  
    plt.plot(Ytest, label='targets')  
    plt.plot(predictions, label='predictions')  
    plt.legend()  
    plt.show()  
  
  # do a quick baseline test  
  baseline = LinearRegression()  
  single_tree = DecisionTreeRegressor()  
  print("CV single tree: {}".format(cross_val_score(single_tree, Xtrain, Ytrain).mean()))  
  print("CV baseline: {}".format(cross_val_score(baseline, Xtrain, Ytrain).mean()))  
  print("CV forest: {}".format(cross_val_score(model, Xtrain, Ytrain).mean()))  
  
  # test score  
  single_tree.fit(Xtrain, Ytrain)  
  baseline.fit(Xtrain, Ytrain)  
  print("test score single tree: {}".format(single_tree.score(Xtest, Ytest)))  
  print("test score baseline: {}".format(baseline.score(Xtest, Ytest)))  
  print("test score forest: {}".format(model.score(Xtest, Ytest)))  

The prediction result:

The corresponding cross validation and scores:

CV single tree: 0.647716136653706
CV baseline: 0.7317741767314697
CV forest: 0.7972932633216651
test score single tree: 0.7812057781690182
test score baseline: 0.8319750076781687
test score forest: 0.901392836040053

Random Forest Classifier (link)
For sample code rf_classification.py, (dataset), it firstly defines the indices of categorical feature and indices of numeric feature:

view plaincopy to clipboardprint?
import numpy as np  
import pandas as pd  
import matplotlib.pyplot as plt  
from sklearn.preprocessing import LabelEncoder, StandardScaler  
from sklearn.tree import DecisionTreeClassifier  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import cross_val_score  
  
NUMERICAL_COLS = ()  
# https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names  
CATEGORICAL_COLS = np.arange(22) + 1 # 1..22 inclusive  

Then is the class for transformation/normalization:

view plaincopy to clipboardprint?
# transforms data from dataframe to numerical matrix  
# one-hot encodes categories and normalizes numerical columns  
# we want to use the scales found in training when transforming the test set  
# so only call fit() once  
# call transform() for any subsequent data  
class DataTransformer:  
  def fit(self, df):  
    self.labelEncoders = {}  
    self.scalers = {}  
    for col in NUMERICAL_COLS:  
      scaler = StandardScaler()  
      scaler.fit(df[col].reshape(-1, 1))  
      self.scalers[col] = scaler  
  
    for col in CATEGORICAL_COLS:  
      encoder = LabelEncoder()  
      # in case the train set does not have 'missing' value but test set does  
      values = df[col].tolist()  
      values.append('missing')  
      encoder.fit(values)  
      self.labelEncoders[col] = encoder  
  
    # find dimensionality  
    self.D = len(NUMERICAL_COLS)  
    for col, encoder in self.labelEncoders.items():  
      self.D += len(encoder.classes_)  
    print("dimensionality: {}".format(self.D))  
  
  def transform(self, df):  
    N, _ = df.shape  
    X = np.zeros((N, self.D))  
    i = 0  
    for col, scaler in self.scalers.items():  
      X[:,i] = scaler.transform(df[col].as_matrix().reshape(-1, 1)).flatten()  
      i += 1  
  
    for col, encoder in self.labelEncoders.items():  
      # print "transforming col:", col  
      K = len(encoder.classes_)  
      X[np.arange(N), encoder.transform(df[col]) + i] = 1  
      i += K  
    return X  
  
  def fit_transform(self, df):  
    self.fit(df)  
    return self.transform(df)  

Then it defines APIs to load data and replace missing values:

view plaincopy to clipboardprint?
def replace_missing(df):  
  # standard method of replacement for numerical columns is median  
  for col in NUMERICAL_COLS:  
    if np.any(df[col].isnull()):  
      med = np.median(df[ col ][ df[col].notnull() ])  
      df.loc[ df[col].isnull(), col ] = med  
  
  # set a special value = 'missing'  
  for col in CATEGORICAL_COLS:  
    if np.any(df[col].isnull()):  
      print("Column: {}".format(col))  
      df.loc[ df[col].isnull(), col ] = 'missing'  
  
  
def get_data():  
  df = pd.read_csv('../large_files/mushroom.data', header=None)  
  
  # replace label column: e/p --> 0/1  
  # e = edible = 0, p = poisonous = 1  
  df[0] = df.apply(lambda row: 0 if row[0] == 'e' else 1, axis=1)  
  
  # check if there is missing data  
  replace_missing(df)  
  
  # transform the data  
  transformer = DataTransformer()  
  
  X = transformer.fit_transform(df)  
  Y = df[0].as_matrix()  
  return X, Y  

Finally, the main program:

view plaincopy to clipboardprint?
if __name__ == '__main__':  
  X, Y = get_data()  
  
  # do a quick baseline test  
  baseline = LogisticRegression()  
  print("CV baseline: {}".format(cross_val_score(baseline, X, Y, cv=8).mean()))  
  
  # single tree  
  tree = DecisionTreeClassifier()  
  print("CV one tree: {}".format(cross_val_score(tree, X, Y, cv=8).mean()))  
  
  model = RandomForestClassifier(n_estimators=20) # try 10, 20, 50, 100, 200  
  print("CV forest: {}".format(cross_val_score(model, X, Y, cv=8).mean()))  

The execution result:

dimensionality: 139
CV baseline: 0.9274806301152012
CV one tree: 0.9308194503704279
CV forest: 0.9354990108994996

Random Forest vs Bagging Trees (link)
Here are going to compare the performance of approach between Bagging Trees & Random Forest (rf_vs_bag.py). First is for regression task:

(X axis is the number of trees; Y axis is the score of regression)

Then is the classification task:

(X axis is the number of trees; Y axis is the accuracy of classification)

Implementing a "Not as Random" Forest (link)
Here we try to mix ideas of Bagging and Random Forest and come up with a class NotAsRandomForest which will train multiple Decision Tree model which adopting bootstrap sampling (rf_vs_bag2.py).

Connection to Deep Learning: Dropout (link)
Dropout Regularization

* Each node can be "used" or "not used"
* 2 states of being for each node
* If Neural network has N nodes, then there are 2^N different possibilities

More

* Dropout "emulates" an ensemble of 2^N networks by randomly dropping nodes.
* During training: Drop node with probability p(drop)
* During prediction: Don't drop anything, multiply every layer by 1 - p(drop)
* Allows you to make a ensemble without making an ensemble
* Similar to random forest: randomly selecting which features to look at
* If you're doing batch training, the sub-network is also seeing a random subset of training samples also.

Supplement
* Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part1
* Intro2ML - Ch2. Supervised Learning - Ensembles of Decision Trees

Ensembles are methods that combine multiple machine learning models to create more powerful models. There are many models in the machine learning literature that belong to this category, but there are two ensemble models that have proven to be effective on a wide range of datasets for classification and regression, both of which use decision trees as their building block: Random Forests and Gradient Boosted Decision Trees.

This message was edited 70 times. Last update was at 08/08/2017 15:53:32

程式扎記

標籤

2017年8月8日星期二

[Udemy] Ensemble Machine Learning in Python: Random Forest, AdaBoost - Part2

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年8月8日 星期二