程式扎記: [ Intro2ML ] Ch2. Supervised Learning

Neural Networks (Deep Learning)
A family of algorithms known as neural networks has recently seen a revival under the name “deep learning”.

While deep learning shows great promise in many machine learning applications, many deep learning algorithms are tailored very carefully to a specific use-case. Here, we will only discuss some relatively simple methods, namely multilayer perceptrons for classification and regression, that can serve as a starting point for more involved deep learning methods. Multilayer perceptrons (MLPs) are also known as (vanilla) feed-forward neural networks, or sometimes just neural networks.

The Neural Network Model
MLPs can be viewed as generalizations of linear models which perform multiple stages of processing to come to a decision. Remember that the prediction by a linear regressor is given as:

In words, y is a weighted sum of the input features x[0] to x[p], weighted by the learned coefficients w[0] to w[p]. where each node on the left represents an input feature, the connecting lines represent the learned coefficients, and the node on the right represents the output, which is a weighted sum of the inputs.

In an MLP, this process of computing weighted sums is repeated multiple times, first computing hidden units that represent an intermediate processing step, which are again combined using weighted sums, to yield the final result:

view plaincopy to clipboardprint?
print("Figure single_hidden_layer")  
mglearn.plots.plot_single_hidden_layer_graph()  

Figure single_hidden_layer

This model has a lot more coefficients (also called weights) to learn: there is one between every input and every hidden unit (which make up the hidden layer), and one between every unit in the hidden layer and the output. Computing a series of weighted sums is mathematically the same as computing just one weighted sum, so to make this model truly more powerful than a linear model, there is one extra trick we need. After computing a weighted sum for each hidden unit, a non-linear function is applied to the result, usually the rectifying nonlinearity (also known as rectified linear unit or relu) or the tangens hyperbolicus (tanh). The result of this function is then used in the weighted sum that computes the output y.

The two functions are visualized in Figure activation_functions. The relu cuts off values below zero, while tanh saturates to -1 for low input values and +1 for high input values. Either non-linear function allows the neural network to learn much more complicated function than a linear model could.

view plaincopy to clipboardprint?
line = np.linspace(-3, 3, 100)  
plt.plot(line, np.tanh(line), label="tanh")  
plt.plot(line, np.maximum(line, 0), label="relu")  
plt.legend(loc="best")  
plt.title("activation_functions")  

For the small neural network pictures in Figure single_hidden_layer above, the full formula for computing y in the case of regression would be (when using a tanh nonlinearity):

Here, w are the weights between the input x and the hidden layer h, and v are the weights between the hidden layer h and the output y. The weights v and w are learned from data, x are the input features, y is the computed output, and h are intermediate computations. An important parameter that needs to be set by the user is the number of nodes in the hidden layer, and can be as small as 10 for very small or simple datasets, and can be as big as 10000 for very complex data.

It is also possible add additional hidden layers, as in Figure two_hidden_layers below. Having large neural networks made up of many of these layers of computation is what inspired the term “deep learning”.

view plaincopy to clipboardprint?
print("Figure two_hidden_layers")  
mglearn.plots.plot_two_hidden_layer_graph()  

Figure two_hidden_layers

Tuning Neural Networks
Let’s look into the workings of the MLP by applying the MLPClassifier to the two_moons dataset we saw above.
- ch2_t31.py

view plaincopy to clipboardprint?
import mglearn  
import matplotlib.pyplot as plt  
import numpy as np  
from sklearn.model_selection import train_test_split  
from sklearn.neural_network import MLPClassifier  
from sklearn.datasets import make_moons  
  
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)  
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)  
  
mlp = MLPClassifier(random_state=0).fit(X_train, y_train)  
mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)  
plt.scatter(X_train[:,0], X_train[:,1], c=y_train, s=60, cmap=mglearn.cm2)  
plt.show()  

As you can see, the neural network learned a very nonlinear but relatively smooth decision boundary. By default, the MLP uses 100 hidden nodes, which is quite a lot for this small dataset. We can reduce the number (which reduces the complexity of the model) and still get a good result:

# hidden_layer_sizes : tuple, length = n_layers - 2, default (100,): The ith element represents the number of neurons in the ith hidden layer.
>>> mlp = MLPClassifier(random_state=0, hidden_layer_sizes=[10])
>>> mlp.n_layers_
3
>>> mlp.fit(X_train, y_train)
...
>>> mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)
>>> plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=60, cmap=mglearn.cm2)

>>> plt.show()

With only 10 hidden units, the decision boundary looks somewhat more ragged. The default nonlinearity is ‘relu', shown in Figure activation_function. With a single hidden layer, this means the decision function will be made up of 10 straight line segments. If we want a smoother decision boundary, we could either add more hidden units (as in the figure above), add second hidden layer, or use the “tanh” nonlinearity:

view plaincopy to clipboardprint?
# using two hidden layers, with 10 units each  
mlp = MLPClassifier(solver='lbfgs', random_state=0, hidden_layer_sizes=[10, 10])  
mlp.fit(X_train, y_train)  
mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)  
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=60, cmap=mglearn.cm2)  

view plaincopy to clipboardprint?
# using two hidden layers, with 10 units each, now with tanh nonlinearity.  
mlp = MLPClassifier(solver='lbfgs', activation='tanh',  
random_state=0, hidden_layer_sizes=[10, 10])  
mlp.fit(X_train, y_train)  
mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3)  
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=60, cmap=mglearn.cm2)  

Finally, we can also control the complexity of a neural network by using an “l2” penalty to shrink the weights towards zero, as we did in ridge regression and the linear classifiers. The parameter for this in the MLPClassifier is alpha (as in the linear regression models), and is set to a very low value (little regularization) by default.

Here is the effect of different values of alpha on the two_moons dataset, using two hidden layers of 10 or 100 units each:
- ch2_t34.py

view plaincopy to clipboardprint?
import mglearn  
import matplotlib.pyplot as plt  
import numpy as np  
from sklearn.model_selection import train_test_split  
from sklearn.neural_network import MLPClassifier  
from sklearn.datasets import make_moons  
  
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)  
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)  
  
#mlp = MLPClassifier(random_state=0).fit(X_train, y_train)  
#mlp = MLPClassifier(random_state=0, hidden_layer_sizes=[10, 10]).fit(X_train, y_train)  
fig, axes = plt.subplots(2, 4, figsize=(20, 8))  
for ax, n_hidden_nodes in zip(axes, [10, 100]):  
    for axx, alpha in zip(ax, [0.0001, 0.01, 0.1, 1]):  
        mlp = MLPClassifier(random_state=0, hidden_layer_sizes=[n_hidden_nodes, n_hidden_nodes], alpha=alpha)  
        mlp.fit(X_train, y_train)  
        mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3, ax=axx)  
        axx.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=60, cmap=mglearn.cm2)  
        axx.set_title("n_hidden=[%d, %d]\nalpha=%.4f" % (n_hidden_nodes, n_hidden_nodes, alpha))  
plt.show()  

As you probably have realized by now, there are many ways to control the complexity of a neural network: the number of hidden layers, the number of units in each hidden layer, and the regularization (alpha). There are actually even more, which we won’t go into here. An important property of neural networks is that their weights are set randomly before learning is started, and this random initialization affects the model that is learned.That means that even when using exactly the same parameters, we can obtain very different models when using different random seeds.

If the networks are large, and their complexity is chosen properly, this should not affect accuracy too much, but it is worth keeping in mind (particularly for smaller networks). Here are plots of several models, all learned with the same settings of the parameters:
- ch2_t35.py

view plaincopy to clipboardprint?
import mglearn  
import matplotlib.pyplot as plt  
import numpy as np  
from sklearn.model_selection import train_test_split  
from sklearn.neural_network import MLPClassifier  
from sklearn.datasets import make_moons  
  
X, y = make_moons(n_samples=100, noise=0.25, random_state=3)  
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)  
  
#mlp = MLPClassifier(random_state=0).fit(X_train, y_train)  
#mlp = MLPClassifier(random_state=0, hidden_layer_sizes=[10, 10]).fit(X_train, y_train)  
fig, axes = plt.subplots(2, 4, figsize=(20, 8))  
for i, ax in enumerate(axes.ravel()):  
    mlp = MLPClassifier(random_state=i, hidden_layer_sizes=[100, 100])  
    mlp.fit(X_train, y_train)  
    mglearn.plots.plot_2d_separator(mlp, X_train, fill=True, alpha=.3, ax=ax)  
    ax.scatter(X_train[:, 0], X_train[:, 1], c=y_train, s=60, cmap=mglearn.cm2)  
plt.show()  

To get a better understanding of neural networks on real-world data, let’s apply the MLPClassifier to the breast cancer dataset. We start with the default parameters:
- ch2_t36.py

view plaincopy to clipboardprint?
import mglearn  
import matplotlib.pyplot as plt  
import numpy as np  
from sklearn.model_selection import train_test_split  
  
from sklearn.neural_network import MLPClassifier  
from sklearn.datasets import load_breast_cancer  
  
cancer = load_breast_cancer()  
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)  
  
mlp = MLPClassifier(solver='lbfgs', shuffle=False)  
mlp.fit(X_train, y_train)  
print("accuracy on training set: %f" % mlp.score(X_train, y_train))  
print("accuracy on test set: %f" % mlp.score(X_test, y_test))  

Execution output:

accuracy on training set: 0.626761
accuracy on test set: 0.629371

As you can see, the result on both the training and the test set are devastatingly bad. As in the SVC example above, this is likely due to scaling of the data. Neural networks also expect all input features to vary in a similar way, and ideally should have a mean of zero, and a variance of one. We must rescale our data so that it fulfills these requirements. Again, we will do this “by hand” here, but introduce the StandardScaler to do this automatically in Chapter 3 (Unsupervised Learning).
- ch2_t37.py

view plaincopy to clipboardprint?
import mglearn  
import matplotlib.pyplot as plt  
import numpy as np  
from sklearn.model_selection import train_test_split  
  
from sklearn.neural_network import MLPClassifier  
from sklearn.datasets import load_breast_cancer  
  
cancer = load_breast_cancer()  
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)  
  
# compute the mean value per feature on the training set  
mean_on_train = X_train.mean(axis=0)  
# compute the standard deviation of each feature on the training set  
std_on_train = X_train.std(axis=0)  
# subtract the mean, scale by inverse standard deviation  
# afterwards, mean=0 and std=1  
X_train_scaled = (X_train - mean_on_train) / std_on_train  
# use THE SAME transformation (using training mean and std) on the test set  
X_test_scaled = (X_test - mean_on_train) / std_on_train  
  
mlp = MLPClassifier(solver='adam', shuffle=False, random_state=0)  
mlp.fit(X_train_scaled, y_train)  
print("accuracy on training set: %f" % mlp.score(X_train_scaled, y_train))  
print("accuracy on test set: %f" % mlp.score(X_test_scaled, y_test))  

Execution output:

...ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet...
accuracy on training set: 0.990610
accuracy on test set: 0.965035

The results are much better after scaling, and already quite competative. We got a warning from the model, though, that tells us that the maximum number of iterations has been reached. This is part of the adam algorithm for learning the model, and tells us that we should increase the number of iterations:

>>> from ch2_t37 import *
# max_iter : int, optional, default 200 - Maximum number of iterations.
>>> mlp = MLPClassifier(max_iter=1000, random_state=0)
>>> mlp.fit(X_train_scaled, y_train)
>>> print("accuracy on training set: %f" % mlp.score(X_train_scaled, y_train))
accuracy on training set: 0.992958
>>> print("accuracy on test set: %f" % mlp.score(X_test_scaled, y_test))
accuracy on test set: 0.952028

Increasing the number of iterations only increased the training set performance, but not the generalization performance. Still, the model is performing quite well. As there is some gap between the training and the test performance, we might try to decrease the model complexity to get better generalization performance. Here, we choose to increase the alpha parameter (quite aggressively, from 0.0001 to 1), to add stronger regularization of the weights.

>>> mlp = MLPClassifier(max_iter=1000, alpha=1, random_state=0)
>>> mlp.fit(X_train_scaled, y_train)
>>> print("accuracy on training set: %f" % mlp.score(X_train_scaled, y_train))
accuracy on training set: 0.988263
>>> print("accuracy on test set: %f" % mlp.score(X_test_scaled, y_test))
accuracy on test set: 0.972028

This leads to a performance on par with the best models so far. (Footnote: You might have noticed at this point that many of the well-performing models achieved exactly the same accuracy of 0.972. This means that all of the models make exactly the same number of mistakes, which is four. If you comparing the actual predictions, you can even see that they make exactly the same mistakes! This might be either a consequence of data being very small, or may be because these points are really different from the rest.)

While it is possible to analyze what a neural network learned, this is usually much trickier than analyzing a linear model or a tree-based model. One way to introspect what was learned is to look at the weights in the model. You can see an example of this in the scikit-learn example gallery on the website. For the breast cancer dataset, this might be a bit hard to understand. The plot below shows the weights that were learned connecting the input to the first hidden layer.

One possible inference we can make is that features that have very small weights for all of the hidden units are “less important” to the model. We can see that “mean smoothness” and “mean compactness” in addition to the features found between “smoothness error” and “fractal dimension error” have relatively low weights compared to other features. This could mean that these are less important features, or, possibly, that we didn’t represent them in a way that the neural network could use.

While the MLPClassifier and MLPRegressor provide easy-to-use interfaces for the most common neural network architectures, they only capture a small subset of what is possible with neural networks. If you are interested in working with more flexible or larger models, we encourage you to look beyond scikit-learn into the fantastic deep learning libraries that are our there. For python users, the most well-established are Keras, Lasagna and Tensor-flow. Keras and Lasagna both build on the theano library.

These libraries provide a much more flexible interface to build neural networks, and track the rapid process in deep learning research. All of the popular deep learning libraries also allow the use of high-performance graphic processing units (GPUs), which scikit-learn does not support. Using GPUs allows to accelerate computations by factors of 10x to 100x, and are essential for applying deep learning methods to large-scale datasets.

Strengths, weaknesses and parameters
Neural networks have re-emerged as state of the art models in many applications of machine learning. One of their main advantages is that they are able to capture information contained in large amounts of data and build incredibly complex models. Given enough computation time, data, and careful tuning of the parameters, neural networks often beat other machine learning algorithms (for classification and regression tasks).

This brings us to the downsides; neural networks, in particular the large and powerful ones, often take a long time to train. They also require careful preprocessing of the data, as we saw above. Similarly to SVMs, they work best with “homogeneous” data, where all the features have similar meanings. For data that has very different kinds of features, tree-based models might work better.

Tuning neural network parameters is also an art onto itself. In our experiments above, we barely scratched the surface of possible ways to adjust neural network models, and how to train them.

Estimating complexity in neural networks
The most important parameters are the number of layers and the number of hidden units per layer. You should start with one or two hidden layers, and possibly expand from there. The number of nodes per hidden layer is often around the number of the input features, but rarely higher than in the low to mid thousands.

A helpful measure when thinking about model complexity of a neural network is the number of weights or coefficients that are learned. If you have a binary classification dataset with 100 features, and you have 100 hidden units, then there are 100 * 100 = 10,000 weights between the input and the first hidden layer. There are also 100 * 1 = 100 weights between the hidden layer and the output layer, for a total of around 10,100 weights. If you add a second hidden layer with 100 hidden units, there will be another 100 * 100 = 10,000 weights from the first hidden layer to the second hidden layer, resulting in a total of 20,100 weights.

If instead, you use one layer with 1000 hidden units, you are learning 100 * 1000 = 100,000 weights from the input to the hidden layer, and 1000 x 1 weights from the hidden to the output layer, for a total of 101,000. If you add a second hidden layer, you add 1000 * 1000 = 1,000,000 weights, for a whopping 1,101,000, which is 50 times larger than the model with two hidden layers of size 100.

A common way to adjust parameters in a neural network is to first create a network that is large enough to overfit, making sure that the task can actually be learned by the network. Once you know the training data can be learned, either shrink the network or increase alpha to add regularization, which will improve generalization performance. During our experiments above, we focused mostly on the definition of the model: the number of layers and nodes per layer, the regularization, and the nonlinearity. These define the model we want to learn. There is also the question of how to learn the model, or the algorithm that is used for learning of the parameters, which is set using the solver parameter.

There are two easy-to-use choices for the algorithm. The default is 'adam', which works well in most situations but is quite sensitive to the scaling of the data (so it is important to always scale your data to zero mean and unit variance). The other one is 'lbfgs', which is quite robust, but might take a long time on larger models or larger datasets; There is also the more advanced 'sgd' option, which is what many deep learning researchers use. The 'sgd' option comes with many additional parameters that need to be tuned for best results. You can find all of these parameters and their definitions in the user-guide. When starting to work with MLPs, we recommend sticking to adam and lbfgs.

Supplement
* [TensorFlow] Tutorials 01 - Simple Linear Model
* [ NNF For Java ] Constructing Neural Networks in Java (Ch4)

程式扎記

標籤

2017年2月11日星期六

[ Intro2ML ] Ch2. Supervised Learning - Neural Networks (Deep Learning)

2 則留言:

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年2月11日 星期六