2017年2月5日 星期日

[ Intro2ML ] Ch2. Supervised Learning - Ensembles of Decision Trees

Ensembles of Decision Trees 
Ensembles are methods that combine multiple machine learning models to create more powerful models. There are many models in the machine learning literature that belong to this category, but there are two ensemble models that have proven to be effective on a wide range of datasets for classification and regression, both of which use decision trees as their building block: Random Forests and Gradient Boosted Decision Trees. 

Random Forests 
As observed above, a main drawback of decision trees is that they tend to overfit the training data. Random forests are one way to address this problem. Random forests are essentially a collection of decision trees, where each tree is slightly different from the others. The idea of random forests is that each tree might do a relatively good job of predicting, but will likely overfit on part of the data. 

If we build many trees, all of which work well and overfit in different ways, we can reduce the amount of overfitting by averaging their results. This reduction in overfitting, while retaining the predictive power of the trees, can be shown using rigorous mathematics. 

To implement this strategy, we need to build many decision tree. Each tree should do an acceptable job of predicting the target, and should also be different from the other trees. Random forests get their name from injecting randomness into the tree building to ensure each tree is different. There are two ways in which the trees in a random forest are randomized: by selecting the data points used to build a tree and by selecting the features in each split test. Let’s go into this process in more detail. 

Building Random Forests 
To build a random forest model, you need to decide on the number of trees to build (the n_estimator parameter of RandomForestRegressor or RandomForestClassifier). Lets say we want to build ten trees. These trees will be built completely independent from each other, and (will?) make random choices to make sure they are distinct (the trees make random choices?). 

To build a tree, we first take what is called a bootstrap sample of our data. A bootstrap sample means from our n_samples data points, we repeatedly draw an example randomly with replacement (i.e. the same sample can be picked multiple times), n_samples times. This will create a dataset that is as big as the original dataset, but some data points will be missing from it, and some will be repeated. 

To illustrate, lets say we want to create a bootstrap sample of the list ['a', 'b', 'c', 'd']. A possible bootstrap sample would be ['b', 'd', 'd', 'c']. Another possible sample would be ['d', 'a', 'd', 'a']. 

Next, a decision tree is built based on this newly created dataset. However, the algorithm we described for the decision tree is slightly modified. Instead of looking for the best test for each node, in each node the algorithm randomly selects a subset of the features, and looks for the best possible test involving one of these features. The amount of features that is selected is controlled by the max_features parameter. This selection of a subset of features is repeated separately in each node, so that each node in a tree can make a decision using a different subset of the features. 

The bootstrap sampling leads to each decision tree in the random forest being built on a slightly different dataset. Because of the selection of features in each node, each split in each tree operates on a different subset of features. Together these two mechanisms ensure that all the trees in the random forests are different

A critical parameter in this process is max_features. If we set max_features to n_features, that means that each split can look at all features in the dataset, and no randomness will be injected. If we set max_features to one, that means that the splits have no choice at all on which feature to test, and can only search over different thresholds for the feature that was selected randomly. 

Therefore, a high max_features means that the trees in the random forest will be quite similar, and they will be able to fit the data easily, using the most distinctive features. A low max_features means that the trees in the random forest will be quite different, and that each tree might need to be very deep in order to fit the data well

To make a prediction using the random forest, the algorithm first makes a prediction for every tree in the forest. For regression, we can average these results to get our final prediction. For classification, a “soft voting” strategy is used. This means each algorithm makes a “soft” prediction, providing a probability for each possible output label. The probabilities predicted by all the trees are averaged, and the class with the highest label is predicted. 

Analyzing Random Forests 
Let’s apply a random forest consisting of five trees to the two_moon data we studied above. 
- ch2_t21.py 
  1. import mglearn  
  2. import matplotlib.pyplot as plt  
  3.   
  4. from sklearn.model_selection import train_test_split  
  5. from sklearn.ensemble import RandomForestClassifier  
  6. from sklearn.datasets import make_moons  
  7.   
  8. X, y = make_moons(n_samples=100, noise=0.25, random_state=3)  
  9. X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)  
  10. forest = RandomForestClassifier(n_estimators=5, random_state=2)  
  11. forest.fit(X_train, y_train)  
  12. print("Accuracy on training set: %f" % forest.score(X_train, y_train))  
  13. print("Accuracy on test set: %f" % forest.score(X_test, y_test))  
Execution output: 
Accuracy on training set: 0.960000
Accuracy on test set: 0.920000

The trees that are built as part of the random forest are stored in the estimator_attribute. Let’s visualize the decision boundaries learned by each tree, together with their aggregate prediction, as made by the forest: 
  1. import os  
  2. dmode = os.environ.get('DISPLAY''')  
  3.   
  4. if dmode:  
  5.     import matplotlib.pyplot as plt  
  6.     import numpy as np  
  7.     fig, axes = plt.subplots(23, figsize=(2010))  
  8.     for i, (ax, tree) in enumerate(zip(axes.ravel(), forest.estimators_)):  
  9.         ax.set_title("tree %d" % i)  
  10.         mglearn.plots.plot_tree_partition(X_train, y_train, tree, ax=ax)  
  11.     mglearn.plots.plot_2d_separator(forest, X_train, fill=True, ax=axes[-1, -1], alpha=.4)  
  12.     axes[-1, -1].set_title("random forest")  
  13.     plt.scatter(X_train[:, 0], X_train[:, 1], c=np.array(['b''r'])[y_train], s=60)  
  14.     plt.show()  

You can clearly see that the decisions learned by the five trees are quite different. Each of them makes some mistakes, as some of the training points that are plotted here were not actually included in the training set of the tree, due to the bootstrap sampling. The random forest overfit less than any of the trees individually, and provides a much more intuitive decision boundary. In any real application, we would use many more trees (often hundreds or thousands), leading to even smoother boundaries. 

Let’s apply a random forest consisting of 100 trees on the breast cancer dataset: 
- ch2_t22.py 
  1. import mglearn  
  2. import matplotlib.pyplot as plt  
  3.   
  4. from sklearn.model_selection import train_test_split  
  5. from sklearn.ensemble import RandomForestClassifier  
  6. from sklearn.datasets import load_breast_cancer  
  7.   
  8. cancer = load_breast_cancer()  
  9. X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)  
  10.   
  11. forest = RandomForestClassifier(n_estimators=100, random_state=0)  
  12. forest.fit(X_train, y_train)  
  13. print("Accuracy on training set: %f" % forest.score(X_train, y_train))  
  14. print("Accuracy on test set: %f" % forest.score(X_test, y_test))  
Execution output 
Accuracy on training set: 1.000000
Accuracy on test set: 0.972028

The random forest gives us an accuracy of 96%, better than the linear models or a single decision tree, without tuning any parameters. We could adjust the max_features setting, or apply pre-pruning as we did for the single decision tree. However, often the default parameters of the random forest already work quite well. 

Similarly to the decision tree, the random forest provides feature importances, which are computed by aggregating the feature importances over the trees in the forest. Typically the feature importances provided by the random forest are more reliable than the ones provided by a single tree. 
  1. import os  
  2. dmode = os.environ.get('DISPLAY''')  
  3.   
  4. if dmode:  
  5.     import matplotlib.pyplot as plt  
  6.     import numpy as np  
  7.     plt.plot(forest.feature_importances_, 'o')  
  8.     plt.xticks(range(cancer.data.shape[1]), cancer.feature_names, rotation=90)  
  9.     plt.show()  

As you can see, the random forest gives non-zero importance to many more features than the single tree.Similarly to the single decision tree, the random forest also gives a lot of importance to the “worst radius”, but it actually chooses “worst perimeter” to be the most informative feature overall. The randomness in building the random forest forces the algorithm to consider many possible explanations, the result of which being that the random forest captures a much broader picture of the data than a single tree. 

Strengths, weaknesses and parameters 
Random forests for regression and classification are currently among the most widely used machine learning methods. They are very powerful, often work well without heavy tuning of the parameters, and don’t require scaling of the data. Essentially, random forests share all of the benefits of decision trees, while making up for some of their deficiencies. 

One reason to still use decision trees is if you need a compact representation of the decision making process. It is basically impossible to interpret tens or hundreds of trees in detail, and trees in random forests tend to be deeper than decision trees (because of the use of feature subsets). Therefore, if you need to summarize the prediction making in a visual way to non-experts, a single decision tree might be a better choice. 

While building random forests on large dataset might be somewhat time-consuming, it an be parallelized across multiple CPU cores within a computer easily. If you are using a multi-core processor (as nearly all modern computers do), you can use the n_jobs parameter to adjust the number of cores to use. Using more CPU cores will result in linear speed-ups (using two cores, the training of the random forest will be twice as fast), but specifying n_jobs larger than the number of cores will not help. You can set n_jobs=-1 to use all the cores in your computer

You should keep in mind that random forests, by their nature, are random, and setting different random states (or not setting the random_state at all) can drastically change the model that is built. The more trees there are in the forest, the more robust it will be against the choice of random state. If you want to have reproducible results, it is important to fix the random_state

Random forests don’t tend to perform well on very high dimensional, sparse data, such as text data. For this kind of data, linear models might be more appropriate. Random forests usually work well even on very large datasets, and training can easily be parallelized over many CPU cores within a powerful computer. However, random forests require more memory and are slower to train and to predict than linear models. If time and memory are important in an application, it might make sense to use a linear model instead. 

The important parameters to adjust are n_estimatorsmax_features and possibly pre-pruning options like max_depth. For n_estimators, larger is always better. Averaging more trees will yield a more robust ensemble. However, there are diminishing returns, and more trees need more memory and more time to train. A common rule of thumb is to build “as many as you have time / memory for”. 

As described above max_features determines how random each tree is, and a smaller max_features reduces overfitting. The default values, and a good rule of thumb, are max_features=sqrt(n_features) for classification and max_features=log2(n_fea tures) for regression. Adding max_features or max_leaf_nodes might sometimes improve performance. It can also drastically reduce space and time requirements for training and prediction. 

Gradient Boosted Regression Trees (Gradient Boosting Machines) 
Gradient boosted regression trees is another ensemble method that combines multiple decision trees to a more powerful model. Despite the “regression” in the name, these models can be used for regression and classification. In contrast to random forests, gradient boosting works by building trees in a serial manner, where each tree tries to correct the mistakes of the previous one. There is no randomization in gradient boosted regression trees; instead, strong pre-pruning is used. Gradient boosted trees often use very shallow trees, of depth one to five, often making the model smaller in terms of memory, and making predictions faster. 

The main idea behind gradient boosting is to combine many simple models (in this context known as weak learners), like shallow trees. Each tree can only provide good predictions on part of the data, and so more and more trees are added to iteratively improve performance. Gradient boosted trees are frequently the winning entries in machine learning competitions, and are widely used in industry. They are generally a bit more sensitive to parameter settings than random forests, but can provide better accuracy if the parameter are set correctly. 

Apart from the pre-pruning and the number of trees in the ensemble, another important parameter of gradient boosting is the learning_rate which controls how strongly each tree tries to correct the mistakes of the previous trees. A higher learning rate means each tree can make stronger corrections, allowing for more complex models. Similarly, adding more trees to the ensemble, which can be done by increasing n_estimators, also increases the model complexity, as the model has more chances to correct mistakes on the training set. 

Here is an example of using GradientBoostingClassifier on the breast cancer dataset. By default, 100 trees of maximum depth three are used, with a learning rate of 0.1. 
- ch2_t23.py 
  1. import mglearn  
  2. import matplotlib.pyplot as plt  
  3.   
  4. from sklearn.model_selection import train_test_split  
  5. from sklearn.ensemble import GradientBoostingClassifier  
  6. from sklearn.datasets import load_breast_cancer  
  7.   
  8. cancer = load_breast_cancer()  
  9. X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)  
  10.   
  11. gbrt = GradientBoostingClassifier(random_state=0)  
  12. gbrt.fit(X_train, y_train)  
  13. print("Accuracy on training set: %f" % gbrt.score(X_train, y_train))  
  14. print("Accuracy on test set: %f" % gbrt.score(X_test, y_test))  
Execution output: 
Accuracy on training set: 1.000000
Accuracy on test set: 0.958042

As the training set accuracy is 100%, we are likely to be overfitting. To reduce overfitting, we could either apply stronger pre-pruning by limiting the maximum depth or lower the learning rate: 
>>> from ch2_t23 import *
>>> gbrt = GradientBoostingClassifier(random_state=0, max_depth=1)
>>> gbrt.fit(X_train, y_train)
>>> print("accuracy on training set: %f" % gbrt.score(X_train, y_train))
accuracy on training set: 0.990610
>>> print("accuracy on test set: %f" % gbrt.score(X_test, y_test))
accuracy on test set: 0.972028 // The testing accuracy is increasing by reducing overfitting
>>> gbrt = GradientBoostingClassifier(random_state=0, learning_rate=0.01) // By decreasing learning_rate can improve overfitting too.
>>> gbrt.fit(X_train, y_train)
>>> print("accuracy on training set: %f" % gbrt.score(X_train, y_train))
accuracy on training set: 0.988263
>>> print("accuracy on test set: %f" % gbrt.score(X_test, y_test))
accuracy on test set: 0.965035 

Both methods of decreasing the model complexity decreased the training set accuracy as expected. In this case, lowering the maximum depth of the trees provided a significant improvement of the model, while lowering the learning rate only increased the generalization performance slightly. 

As for the other decision tree based models, we can again visualize the feature importances to get more insight into our model. As we used 100 trees, it is impractical to inspect them all, even if they are all of depth 1. 

We can see that the feature importances of the gradient boosted trees are somewhat similar to the feature importances of the random forests, though the gradient boosting completely ignored some of the features. As gradient boosting and random forest perform well on similar kinds of data, a common approach is to first try random forests, which work quite robustly. If random forests work well, but prediction time is at a premium, or it is important to squeeze out the last percentage of accuracy from the machine learning model, moving to gradient boosting often helps. 

If you want to apply gradient boosting to a large scale problem, it might be worth looking into the xgboost package and its python interface, which at the time of writing is faster (and sometimes easier to tune) than the scikit-learn implementation of gradient boosting on many datasets. 

Strengths, weaknesses and parameters 
Gradient boosted decision trees are among the most powerful and widely used models for supervised learning. Their main drawback is that they require careful tuning of the parameters, and may take a long time to train. Similarly to other tree-based models, the algorithm works well without scaling and on a mixture of binary and continuous features. As other tree-based models, it also often does not work well on high-dimensional sparse data. 

The main parameters of the gradient boosted tree models are the number of trees n_estimators, and the learning_rate, which controls how much each tree is allowed to correct the mistakes of the previous trees. 

These two parameters are highly interconnected, as a lower learning_rate means that more trees are needed to build a model of similar complexity. In contrast to random forests, where higher n_estimators is always better, increasing n_estimators in gradient boosting leads to a more complex model, which may lead to overfitting. A common practice is to fit n_estimators depending on the time and memory budget, and then search over different learning_rates. Another important parameter is max_depth, which is usually very low for gradient boosted models, often not deeper than five splits. 

Supplement 
[ ML In Action ] Improving classification with the AdaBoost meta-algorithm

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...