The next type of supervised model we will discuss is kernelized support vector machines (SVMs). We already saw linear support vector machines for classification in the linear model section. Kernelized support vector machines (often just referred to as SVMs) are a extension that allows for more complex models which are not defined simply by hyperplanes in the input space. While there are support vector machines for classification and regression, we will restrict ourself to the classification case, as implemented in SVC. Similar concepts apply to support vector regression, as implemented in SVR.
The math behind kernelized support vector machines is a bit involved, and is mostly beyond the scope of this book. However, we will try to give you some intuitions about the idea behind the method.
Linear Models and Non-linear Features
As you saw in Figure linear_classifiers, linear models can be quite limiting in lowdimensional spaces, as lines or hyperplanes have limited flexibility. One way to make a linear model more flexible is by adding more features, for example by adding interactions or polynomials of the input features. Let’s look at the synthetic dataset we used in Figure tree_not_monotone:
- from sklearn.datasets import make_blobs
- import matplotlib.pyplot as plt
- X, y = make_blobs(centers=4, random_state=8)
- y = y % 2
- plt.scatter(X[:, 0], X[:, 1], c=y, s=60, cmap=mglearn.cm2)
- plt.xlabel("feature1")
- plt.ylabel("feature2")
A linear model for classification can only separate points using a line, and will not be able to do a very good job on this dataset:
- ch2_t24.py
- from sklearn.datasets import make_blobs
- X, y = make_blobs(centers=4, random_state=8)
- y = y % 2
- from sklearn.svm import LinearSVC
- linear_svm = LinearSVC().fit(X, y)
- #print("Accuracy on training set: %f" % linear_svm.score(X_train, y_train))
- #print("Accuracy on test set: %f" % gbrt.score(X_test, y_test))
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- import matplotlib.pyplot as plt
- import numpy as np
- mglearn.plots.plot_2d_separator(linear_svm, X)
- plt.scatter(X[:, 0], X[:, 1], c=y, s=60, cmap=mglearn.cm2)
- plt.xlabel("feature1")
- plt.ylabel("feature2")
- plt.show()
Now, let’s expand the set of input features, say by also adding feature2 ** 2, the square of the second feature, as a new feature. Instead of representing each data point as a two-dimensional point (feature1, feature2), we now represent it as a threedimensional point (feature1, feature2, feature2 ** 2) (Footnote: We picked this particular feature to add for illustration purposes. The choice is not particular important.). This new representation is illustrated below in a three-dimensional scatter plot:
- # add the squared first feature
- X_new = np.hstack([X, X[:, 1:] ** 2])
- from mpl_toolkits.mplot3d import Axes3D, axes3d
- figure = plt.figure()
- # visualize in 3D
- ax = Axes3D(figure, elev=-152, azim=-26)
- ax.scatter(X_new[:, 0], X_new[:, 1], X_new[:, 2], c=y, cmap=mglearn.cm2, s=60)
- ax.set_xlabel("feature1")
- ax.set_ylabel("feature2")
- ax.set_zlabel("feature1 ** 2")
In the new, three-dimensional representation of the data, it is now possible to separate the red and the blue points using a linear model. We can confirm this by fitting a linear model to the augmented data:
- ch2_t25.py
- import mglearn
- import matplotlib.pyplot as plt
- from mpl_toolkits.mplot3d import Axes3D, axes3d
- import numpy as np
- from sklearn.model_selection import train_test_split
- from sklearn.ensemble import GradientBoostingClassifier
- from sklearn.datasets import load_breast_cancer
- from sklearn.datasets import make_blobs
- X, y = make_blobs(centers=4, random_state=8)
- X_new = np.hstack([X, X[:, 1:] ** 2])
- y = y % 2
- from sklearn.svm import LinearSVC
- linear_svm_3d = LinearSVC().fit(X_new, y)
- coef, intercept = linear_svm_3d.coef_.ravel(), linear_svm_3d.intercept_
- #print("Accuracy on training set: %f" % linear_svm.score(X_train, y_train))
- #print("Accuracy on test set: %f" % gbrt.score(X_test, y_test))
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # show linear decision boundary
- figure = plt.figure()
- ax = Axes3D(figure, elev=-152, azim=-26)
- xx = np.linspace(X_new[:, 0].min(), X_new[:, 0].max(), 50)
- yy = np.linspace(X_new[:, 1].min(), X_new[:, 1].max(), 50)
- XX, YY = np.meshgrid(xx, yy)
- ZZ = (coef[0] * XX + coef[1] * YY + intercept) / -coef[2]
- ax.scatter(X_new[:, 0], X_new[:, 1], X_new[:, 2], c=y, cmap=mglearn.cm2, s=60)
- ax.plot_surface(XX, YY, ZZ, rstride=8, cstride=8, alpha=0.3)
- ax.set_xlabel("feature1")
- ax.set_ylabel("feature2")
- ax.set_zlabel("feature1 ** 2")
- plt.show()
As a function of the original features, the linear SVM model is not actually linear anymore. It is not a line, but more of an ellipse.
- ch2_t26.py
- import mglearn
- import matplotlib.pyplot as plt
- from mpl_toolkits.mplot3d import Axes3D, axes3d
- import numpy as np
- from sklearn.model_selection import train_test_split
- from sklearn.datasets import make_blobs
- X, y = make_blobs(centers=4, random_state=8)
- X_new = np.hstack([X, X[:, 1:] ** 2])
- y = y % 2
- from sklearn.svm import LinearSVC
- linear_svm_3d = LinearSVC().fit(X_new, y)
- coef, intercept = linear_svm_3d.coef_.ravel(), linear_svm_3d.intercept_
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # show linear decision boundary
- figure = plt.figure()
- xx = np.linspace(X_new[:, 0].min(), X_new[:, 0].max(), 50)
- yy = np.linspace(X_new[:, 1].min(), X_new[:, 1].max(), 50)
- XX, YY = np.meshgrid(xx, yy)
- ZZ = (coef[0] * XX + coef[1] * YY + intercept) / -coef[2]
- ZZ = YY ** 2
- dec = linear_svm_3d.decision_function(np.c_[XX.ravel(), YY.ravel(), ZZ.ravel()])
- plt.contourf(XX, YY, dec.reshape(XX.shape), levels=[dec.min(), 0, dec.max()],
- cmap=mglearn.cm2, alpha=0.5)
- plt.scatter(X[:, 0], X[:, 1], c=y, s=60, cmap=mglearn.cm2)
- plt.xlabel("feature1")
- plt.ylabel("feature2")
- plt.show()
The Kernel Trick
The lesson here is that adding non-linear features to the representation of our data can make linear models much more powerful. However, often we don’t know which features to add, and adding many features (like all possible interactions in a 100 dimensional feature space) might make computation very expensive. Luckily, there is a clever mathematical trick that allows us to learn a classifier in a higher dimensional space without actually computing the new, possibly very large representation. This trick is known as the kernel trick.
The kernel trick works by directly computing the distance (more precisely, the scalar products) of the data points for the expanded feature representation, without ever actually computing the expansion. There are two ways to map your data into a higher dimensional space that are commonly used with support vector machines: the polynomial kernel, which computes all possible polynomials up to a certain degree of the original features (like feature1 ** 2 * feature2 ** 5), and the radial basis function (rbf) kernel, also known as Gaussian kernel.
The Gaussian kernel is a bit harder to explain, as it corresponds to an infinite dimensional feature space. One way to explain the Gaussian kernel is that it considers all possible polynomials of all degrees, but the importance of the features decreases for higher degrees. (Footnote: this follows from the Taylor expansion of the exponential map).
If all of this is too much math talk for you, don’t worry. You can still use SVMs without trying to imagine infinite dimensional feature spaces. In practice, how a SVM with an rbf kernel makes a decision can be summarized quite easily.
Understanding SVMs
During training, the SVM learns how important each of the training data points is to represent the decision boundary between the two classes. Typically only a subset of the training points matter for defining the decision boundary: the ones that lie on the border between the classes. These are called support vectors and give the support vector machine its name.
To make a prediction for a new point, the distance to the support vectors is measured. A classification decision is made based on the distance to the support vectors, and the importance of the support vectors that was learned during training (stored in the dual_coef_ attribute of SVC). The way distance between data points is measured by the Gaussian kernel (source):
Below is the result of training an support vector machine on a two-dimensional two class dataset. The decision boundary is shown in black, and the support vectors are the points with wide black circles.
In this case, the SVM yields a very smooth and non-linear (not a straight line) boundary. There are two parameters we adjusted here: The C parameter and the gamma parameter, which we will now discuss in detail.
Tuning SVM parameters
The gamma (γ) parameter is the one shown in above Formula, which controls the width of the Gaussian kernel. It determines the scale of what it means for points to be close together; The C parameter is a regularization parameter similar to the linear models. It limits the importance of each point (or more precisely, their dual_coef_).
Let’s have a look at what happens when we vary these parameters:
- ch2_t28.py
- import mglearn
- import matplotlib.pyplot as plt
- from mpl_toolkits.mplot3d import Axes3D, axes3d
- import numpy as np
- from sklearn.model_selection import train_test_split
- from sklearn.datasets import make_blobs
- from sklearn.svm import SVC
- X, y = mglearn.tools.make_handcrafted_dataset()
- svm = SVC(kernel='rbf', C=10, gamma=0.1).fit(X, y)
- import os
- dmode = os.environ.get('DISPLAY', '')
- if dmode:
- # show linear decision boundary
- #mglearn.plots.plot_2d_separator(svm, X, eps=.5)
- fig, axes = plt.subplots(3, 3, figsize=(15, 10))
- for ax, C in zip(axes, [-1, 0, 3]):
- for a, gamma in zip(ax, range(-1, 2)):
- mglearn.plots.plot_svm(log_C=C, log_gamma=gamma, ax=a)
- plt.show()
Going from left to right, we increase the parameter gamma from 0.1 to 10. A small gamma means a large radius for the Gaussian kernel, which means that many points are considered close-by. This is reflected in very smooth decision boundaries on the left, and boundaries that focus more on single points further to the right. A low value of gamma means that the decision boundary will vary slowly, which yields a model of low complexity, while a high value of gamma yields a more complex model.
Going from top to bottom, we increase the C parameter from 0.1 to 1000. As with the linear models, a small C means a very restricted model, where each data point can only have very limited influence. You can see that in the top left, the decision boundary looks nearly linear, with the red and blue points that are misclassified barely changing the line; Increasing C, as shown on the bottom right, allows these points to have a stronger influence on the model, and makes the decision boundary bend to correctly classify them.
Let’s apply the rbf kernel SVM to the breast cancer dataset. By default, C=1 and gamma=1./n_features.
- ch2_t29.py
- import mglearn
- import matplotlib.pyplot as plt
- from mpl_toolkits.mplot3d import Axes3D, axes3d
- import numpy as np
- from sklearn.model_selection import train_test_split
- from sklearn.svm import SVC
- from sklearn.datasets import load_breast_cancer
- cancer = load_breast_cancer()
- X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
- svc = SVC().fit(X_train, y_train)
- print("accuracy on training set: %f" % svc.score(X_train, y_train))
- print("accuracy on test set: %f" % svc.score(X_test, y_test))
The model overfit quite substantially, with a perfect score on the training set and only 62% accuracy on the test set. While SVMs often perform quite well, they are very sensitive to the settings of the parameters, and to the scaling of the data. In particular, they require all the features to vary on a similar scale. Let’s look at the minimum and maximum values for each feature, plotted in log-space:
- plt.plot(X_train.min(axis=0), 'o', label="min")
- plt.plot(X_train.max(axis=0), 'o', label="max")
- plt.legend(loc="best")
- plt.yscale("log")
From this plot we can determine that features in the breast cancer dataset are of completely different orders of magnitude. This can be somewhat of a problem for other models (like linear models), but it has devastating effects for the kernel SVM.
Preprocessing Data for SVMs
One way to resolve this problem is by rescaling each feature, so that they are approximately on the same scale. A common rescaling methods for kernel SVMs is to scale the data such that all features are between zero and one. We will see how to do this using the MinMaxScaler preprocessing method in Chapter 3 (Unsupervised Learning), where we’ll give more details.
For now, let’s do this “by hand”:
- ch2_t30.py
- import mglearn
- import matplotlib.pyplot as plt
- import numpy as np
- from sklearn.model_selection import train_test_split
- from sklearn.svm import SVC
- from sklearn.datasets import load_breast_cancer
- cancer = load_breast_cancer()
- X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0)
- # Compute the minimum value per feature on the training set
- min_on_training = X_train.min(axis=0)
- # Compute the range of each feature (max - min) on the training set
- range_on_training = (X_train - min_on_training).max(axis=0)
- # subtract the min, divide by range
- # afterwards min=0 and max=1 for each feature
- X_train_scaled = (X_train - min_on_training) / range_on_training
- print("Minimum for each feature\n%s" % X_train_scaled.min(axis=0))
- print("Maximum for each feature\n %s" % X_train_scaled.max(axis=0))
Scaling the data made a huge difference! Now we are actually in an underfitting regime, where training and test set performance are quite similar. From here, we can try increasing either C or gamma to fit a more complex model:
- svc = SVC(C=1000)
- svc.fit(X_train_scaled, y_train)
- print("accuracy on training set: %f" % svc.score(X_train_scaled, y_train))
- print("accuracy on test set: %f" % svc.score(X_test_scaled, y_test))
Here, increasing C allows us to improve the model significantly, resulting in 97.2% accuracy.
Strengths, weaknesses and parameters
Kernelized support vector machines are very powerful models and perform very well on a variety of datasets.
SVMs allow for very complex decision boundaries, even if the data has only a few features. SVMs work well on low-dimensional and high-dimensional data (i.e. few and many features), but don’t scale very well with the number of samples. Running on data with up to 10000 samples might work well, but working with datasets of size 100000 or more can become challenging in terms of runtime and memory usage.
Another downside of SVMs is that they require careful preprocessing of the data and tuning of the parameters.
For this reason, SVMs have been replaced by tree-based models such as random forests (that require little or no preprocessing) in many applications. Furthermore, SVM models are hard to inspect; it can be difficult to understand why a particular prediction was made, and it might be tricky to explain the model to a non-expert.
Still it might be worth trying SVMs, particularly if all of your features represent measurements in similar units (i.e. all are pixel intensities) and are on similar scales.
The important parameters in kernel SVMs are the regularization parameter C, the choice of the kernel, and the kernel-specific parameters. We only talked about the rbf kernel in any depth above, but other choices are available in scikit-learn. The rbf kernel has only one parameter, gamma, which is the inverse of the width of the Gaussian kernel. gamma and C both control the complexity of the model, with large values in either resulting in a more complex model. Therefore, good settings for the two parameters are usually strongly correlated, and C and gamma should be adjusted together.
沒有留言:
張貼留言