2020年1月5日 星期日

[Py DS] Ch5 - Machine Learning (Part8)

In-Depth: Decision Trees and Random Forests
Previously we have looked in depth at a simple generative classifier (naive Bayes; see “In Depth: Naive Bayes Classification) and a powerful discriminative classifier (support vector machines; see “In-Depth: Support Vector Machines). Here we’ll take a look at motivating another powerful algorithm—a nonparametric algorithm called random forests. Random forests are an example of an ensemble method, a method that relies on aggregating the results of an ensemble of simpler estimators. The somewhat surprising result with such ensemble methods is that the sum can be greater than the parts; that is, a majority vote among a number of estimators can end up being better than any of the individual estimators doing the voting! We will see examples of this in the following sections. We begin with the standard imports:
  1. %matplotlib inline  
  2.   
  3. import numpy as np  
  4. import matplotlib.pyplot as plt  
  5. import seaborn as sns; sns.set()  
Motivating Random Forests: Decision Trees
Random forests are an example of an ensemble learner built on decision trees. For this reason we’ll start by discussing decision trees themselves.

Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero in on the classification. For example, if you wanted to build a decision tree to classify an animal you come across while on a hike, you might construct the one shown in Figure 5-67.


Figure 5-67. An example of a binary decision tree

The binary splitting makes this extremely efficient: in a well-constructed tree, each question will cut the number of options by approximately half, very quickly narrowing the options even among a large number of classes. The trick, of course, comes in deciding which questions to ask at each step. In machine learning implementations of decision trees, the questions generally take the form of axis-aligned splits in the data; that is, each node in the tree splits the data into two groups using a cutoff value within one of the features. Let’s now take a look at an example.

Creating a decision tree
Consider the following two-dimensional data, which has one of four class labels (Figure 5-68):
  1. from sklearn.datasets import make_blobs  
  2. X, y = make_blobs(n_samples=300, centers=4, random_state=0, cluster_std=1.0)  
  3. plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='rainbow');  

Figure 5-68. Data for the decision tree classifier


A simple decision tree built on this data will iteratively split the data along one or the other axis according to some quantitative criterion, and at each level assign the label of the new region according to a majority vote of points within it. Figure 5-69 presents a visualization of the first four levels of a decision tree classifier for this data.
Figure 5-69. Visualization of how the decision tree splits the data


Notice that after the first split, every point in the upper branch remains unchanged, so there is no need to further subdivide this branch. Except for nodes that contain all of one color, at each level every region is again split along one of the two features. This process of fitting a decision tree to our data can be done in Scikit-Learn with the DecisionTreeClassifier estimator:
  1. from sklearn.tree import DecisionTreeClassifier  
  2.   
  3. tree = DecisionTreeClassifier().fit(X, y)  
Let’s write a quick utility function to help us visualize the output of the classifier:
  1. def visualize_classifier(model, X, y, ax=None, cmap='rainbow'):  
  2.     ax = ax or plt.gca()  
  3.     # Plot the training points  
  4.     ax.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=cmap,  
  5.     clim=(y.min(), y.max()), zorder=3)  
  6.     ax.axis('tight')  
  7.     ax.axis('off')  
  8.     xlim = ax.get_xlim()  
  9.     ylim = ax.get_ylim()  
  10.     # fit the estimator  
  11.     model.fit(X, y)  
  12.     xx, yy = np.meshgrid(np.linspace(*xlim, num=200),  
  13.     np.linspace(*ylim, num=200))  
  14.     Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)  
  15.     # Create a color plot with the results  
  16.     n_classes = len(np.unique(y))  
  17.     contours = ax.contourf(xx, yy, Z, alpha=0.3, levels=np.arange(n_classes + 1) - 0.5,  
  18.     cmap=cmap, clim=(y.min(), y.max()), zorder=1)  
  19.     ax.set(xlim=xlim, ylim=ylim)  
Now we can examine what the decision tree classification looks like (Figure 5-70):
  1. visualize_classifier(DecisionTreeClassifier(), X, y)  

Figure 5-70. Visualization of a decision tree classification


If you’re running this notebook live, you can use the helpers script included in the online appendix to bring up an interactive visualization of the decision tree building process (Figure 5-71):
  1. import helpers_05_08  
  2. helpers_05_08.plot_tree_interactive(X, y);  

Figure 5-71. First frame of the interactive decision tree widget; for the full version, see the online appendix


Notice that as the depth increases, we tend to get very strangely shaped classification regions; for example, at a depth of five, there is a tall and skinny purple region between the yellow and blue regions. It’s clear that this is less a result of the true, intrinsic data distribution, and more a result of the particular sampling or noise properties of the data. That is, this decision tree, even at only five levels deep, is clearly overfitting our data.

Decision trees and overfitting
Such overfitting turns out to be a general property of decision trees; it is very easy to go too deep in the tree, and thus to fit details of the particular data rather than the overall properties of the distributions they are drawn from. Another way to see this overfitting is to look at models trained on different subsets of the data—for example, in Figure 5-72 we train two different trees, each on half of the original data.

Figure 5-72. An example of two randomized decision trees

It is clear that in some places, the two trees produce consistent results (e.g., in the four corners), while in other places, the two trees give very different classifications (e.g., in the regions between any two clusters). The key observation is that the inconsistencies tend to happen where the classification is less certain, and thus by using information from both of these trees, we might come up with a better result!

If you are running this notebook live, the following function will allow you to interactively display the fits of trees trained on a random subset of the data (Figure 5-73):
  1. # helpers_05_08 is found in the online appendix  
  2. # (https://github.com/jakevdp/PythonDataScienceHandbook)  
  3. import helpers_05_08  
  4. helpers_05_08.randomized_tree_interactive(X, y)  

Figure 5-73. First frame of the interactive randomized decision tree widget; for the full version, see the online appendix


Just as using information from two trees improves our results, we might expect that using information from many trees would improve our results even further.

Ensembles of Estimators: Random Forests
This notion—that multiple overfitting estimators can be combined to reduce the effect of this overfitting—is what underlies an ensemble method called bagging. Bagging makes use of an ensemble (a grab bag, perhaps) of parallel estimators, each of which overfits the data, and averages the results to find a better classification. An ensemble of randomized decision trees is known as a random forest.

We can do this type of bagging classification manually using Scikit-Learn’s BaggingClassifier meta-estimator as shown here (Figure 5-74):
  1. from sklearn.tree import DecisionTreeClassifier  
  2. from sklearn.ensemble import BaggingClassifier  
  3.   
  4. tree = DecisionTreeClassifier()  
  5. bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8,  
  6. random_state=1)  
  7. bag.fit(X, y)  
  8. visualize_classifier(bag, X, y)  

Figure 5-74. Decision boundaries for an ensemble of random decision trees


In this example, we have randomized the data by fitting each estimator with a random subset of 80% of the training points. In practice, decision trees are more effectively randomized when some stochasticity is injected in how the splits are chosen; this way, all the data contributes to the fit each time, but the results of the fit still have the desired randomness. For example, when determining which feature to split on, the randomized tree might select from among the top several features. You can read more technical details about these randomization strategies in the Scikit-Learn documentation and references within.

In Scikit-Learn, such an optimized ensemble of randomized decision trees is implemented in the RandomForestClassifier estimator, which takes care of all the randomization automatically. All you need to do is select a number of estimators, and it will very quickly (in parallel, if desired) fit the ensemble of trees (Figure 5-75):
  1. from sklearn.ensemble import RandomForestClassifier  
  2.   
  3. model = RandomForestClassifier(n_estimators=100, random_state=0)  
  4. visualize_classifier(model, X, y);  

Figure 5-75. Decision boundaries for a random forest, which is an optimized ensemble of decision trees


We see that by averaging over 100 randomly perturbed models, we end up with an overall model that is much closer to our intuition about how the parameter space should be split.

Random Forest Regression
In the previous section we considered random forests within the context of classification. Random forests can also be made to work in the case of regression (that is, continuous rather than categorical variables). The estimator to use for this is the RandomForestRegressor, and the syntax is very similar to what we saw earlier.

Consider the following data, drawn from the combination of a fast and slow oscillation (Figure 5-76):
  1. rng = np.random.RandomState(42)  
  2. x = 10 * rng.rand(200)  
  3.   
  4. def model(x, sigma=0.3):  
  5.     fast_oscillation = np.sin(5 * x)  
  6.     slow_oscillation = np.sin(0.5 * x)  
  7.     noise = sigma * rng.randn(len(x))  
  8.     return slow_oscillation + fast_oscillation + noise  
  9.   
  10. y = model(x)  
  11. plt.errorbar(x, y, 0.3, fmt='o');  

Figure 5-76. Data for random forest regression


Using the random forest regressor, we can find the best-fit curve as follows (Figure 5-77):
  1. from sklearn.ensemble import RandomForestRegressor  
  2. plt.rcParams["figure.figsize"] = (157)  
  3. forest = RandomForestRegressor(200)  
  4. forest.fit(x[:, None], y)  
  5.   
  6. xfit = np.linspace(0101000)  
  7. yfit = forest.predict(xfit[:, None])  
  8.   
  9. ytrue = model(xfit, sigma=0)  
  10.   
  11. plt.errorbar(x, y, 0.3, fmt='o', alpha=0.5)  
  12. plt.plot(xfit, yfit, '-r');  
  13. plt.plot(xfit, ytrue, '-k', alpha=0.5);  


Figure 5-77. Random forest model fit to the data

Here the true model is shown by the smooth curve, while the random forest model is shown by the jagged curve. As you can see, the nonparametric random forest model is flexible enough to fit the multiperiod data, without us needing to specify a multiperiod model!

Example: Random Forest for Classifying Digits
Earlier we took a quick look at the handwritten digits data (see “Introducing Scikit-Learn” on page 343). Let’s use that again here to see how the random forest classifier can be used in this context.
  1. from sklearn.datasets import load_digits  
  2. digits = load_digits()  
  3. digits.keys()  
Output:
dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

To remind us what we’re looking at, we’ll visualize the first few data points (Figure 5-78):
  1. # set up the figure  
  2. fig = plt.figure(figsize=(66)) # figure size in inches  
  3. fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)  
  4.   
  5. # plot the digits: each image is 8x8 pixels  
  6. for i in range(64):  
  7.     ax = fig.add_subplot(88, i + 1, xticks=[], yticks=[])  
  8.     ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')  
  9.     # label the image with the target value  
  10.     ax.text(07, str(digits.target[i]))  


Figure 5-78. Representation of the digits data

We can quickly classify the digits using a random forest as follows (Figure 5-79):
  1. from sklearn.model_selection import train_test_split  
  2.   
  3. Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target, random_state=0)  
  4. model = RandomForestClassifier(n_estimators=1000)  
  5. model.fit(Xtrain, ytrain)  
  6. ypred = model.predict(Xtest)  
We can take a look at the classification report for this classifier:
  1. from sklearn import metrics  
  2. print(metrics.classification_report(ypred, ytest))  



And for good measure, plot the confusion matrix (Figure 5-79):
  1. from sklearn.metrics import confusion_matrix  
  2.   
  3. mat = confusion_matrix(ytest, ypred)  
  4. sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False)  
  5. plt.xlabel('true label')  
  6. plt.ylabel('predicted label');  

Figure 5-79. Confusion matrix for digit classification with random forests

We find that a simple, untuned random forest results in a very accurate classification of the digits data.

Summary of Random Forests
This section contained a brief introduction to the concept of ensemble estimators, and in particular the random forest model—an ensemble of randomized decision trees. Random forests are a powerful method with several advantages:
* Both training and prediction are very fast, because of the simplicity of the underlying decision trees. In addition, both tasks can be straightforwardly parallelized, because the individual trees are entirely independent entities.
* The multiple trees allow for a probabilistic classification: a majority vote among estimators gives an estimate of the probability (accessed in Scikit-Learn with the predict_proba() method).
* The nonparametric model is extremely flexible, and can thus perform well on tasks that are underfit by other estimators.

A primary disadvantage of random forests is that the results are not easily interpretable; that is, if you would like to draw conclusions about the meaning of the classification model, random forests may not be the best choice.

Supplement
Matplotlib - pylab_examples example code: errorbar_demo.py
FAQ - Micro Average vs Macro average Performance in a Multiclass classification setting

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...