Another way to enrich a feature representation, in particular for linear models, is adding interaction features and polynomial features of the original data. This kind of feature engineering is often used in statistical modelling, but also common in many practical machine learning applications. As a first example, look again at Figure linear_binning above. The linear model learned a constant value for each bin on the wave dataset. We know, however, that linear models can not only learn offsets, but also slopes. One way to add a slope to the linear model on the binned data, is to add the original feature (the x axis in the plot) back in.
This leads to a 11 dimensional dataset, as seen in Figure 4-3:
- ch5_t05.py
- #!/usr/bin/env python
- import numpy as np
- import mglearn
- from sklearn.linear_model import LinearRegression
- from sklearn.tree import DecisionTreeRegressor
- X, y = mglearn.datasets.make_wave(n_samples=100)
- line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
- reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
- bins = np.linspace(-3, 3, 11)
- print "bins: {}".format(bins)
- which_bin = np.digitize(X, bins=bins)
- print "\nData points:\n%s\n" % X[:5]
- print "\nBin membership for data points:\n%s\n" % which_bin[:5]
- from sklearn.preprocessing import OneHotEncoder
- # Transform using the OneHotEncoder
- encoder = OneHotEncoder(sparse=False)
- # encoder.fit finds the unique values that appear in what_bin
- encoder.fit(which_bin)
- # Transform creates the one-hot encoding
- X_binned = encoder.transform(which_bin)
- print "Top 5 rows in X_binned:\n%s\n" % X_binned[:5]
- print "X_binned.shape = %s\n" % str(X_binned.shape)
- line_binned = encoder.transform(np.digitize(line, bins=bins))
- print "line_binned:\n%s\n" % (line_binned[:10])
- X_combined = np.hstack([X, X_binned])
- print(X_combined.shape)
- reg = LinearRegression().fit(X_combined, y)
- import matplotlib.pyplot as plt
- line_combined = np.hstack([line, line_binned])
- plt.plot(line, reg.predict(line_combined), label='linear regression binned')
- for bin in bins:
- plt.plot([bin, bin], [-3, 3], ':', c='k')
- plt.plot(X[:, 0], y, 'o', c='k')
- plt.legend(loc="best")
- plt.ylabel("Regression output")
- plt.xlabel("Input feature")
- plt.show()
In this example, the model learned an offset for each bin, together with a slope. The learned slope is downward, and shared access all the bins - there is a single x-axis feature, which has a single slope. Because the slope is shared across all bins, it doesn't seem to be very helpful. We would rather have a separate slope for each bin! We can achieve this by adding an interaction or product feature feature that indicates which bin a data point is in and where it lies on the x-axis. This feature is a product of the bin indicator and the original feature. Let's create this dataset:
- X_product = np.hstack([X_binned, X * X_binned])
- print(X_product.shape)
The dataset now has 20 features: the indicator for which bin a data point is in, and a product of the original feature and the bin indicator. You can think of the product feature as a separate copy of the x-axis feature for each bin. It is the original feature within the bin, and zero everywhere else. Figure 4.4 shows the result of the linear model on this new representation:
- ch5_t06.py
- #!/usr/bin/env python
- import numpy as np
- import mglearn
- from sklearn.linear_model import LinearRegression
- from sklearn.tree import DecisionTreeRegressor
- X, y = mglearn.datasets.make_wave(n_samples=100)
- line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
- reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
- bins = np.linspace(-3, 3, 11)
- print "bins: {}".format(bins)
- which_bin = np.digitize(X, bins=bins)
- print "\nData points:\n%s\n" % X[:5]
- print "\nBin membership for data points:\n%s\n" % which_bin[:5]
- from sklearn.preprocessing import OneHotEncoder
- # Transform using the OneHotEncoder
- encoder = OneHotEncoder(sparse=False)
- # encoder.fit finds the unique values that appear in what_bin
- encoder.fit(which_bin)
- # Transform creates the one-hot encoding
- X_binned = encoder.transform(which_bin)
- print "Top 5 rows in X_binned:\n%s\n" % X_binned[:5]
- print "X_binned.shape = %s\n" % str(X_binned.shape)
- line_binned = encoder.transform(np.digitize(line, bins=bins))
- print "line_binned:\n%s\n" % (line_binned[:10])
- #X_combined = np.hstack([X, X_binned])
- X_product = np.hstack([X_binned, X * X_binned])
- print(X_product.shape)
- reg = LinearRegression().fit(X_product, y)
- import matplotlib.pyplot as plt
- line_product = np.hstack([line_binned, line * line_binned])
- plt.plot(line, reg.predict(line_product), label='linear regression product')
- for bin in bins:
- plt.plot([bin, bin], [-3, 3], ':', c='k')
- plt.plot(X[:, 0], y, 'o', c='k')
- plt.legend(loc="best")
- plt.ylabel("Regression output")
- plt.xlabel("Input feature")
- plt.show()
As you can see, now each bin has its own offset and slope in this model. Using binning is one way to expand a continuous feature. Another one is to use polynomials of the original features. For a given feature x, we might want to consider x**2, x**3, x**4 and so on. This is implemented in PolynomialFeatures in the preprocessing module:
- from sklearn.preprocessing import PolynomialFeatures
- # Include polynomials up to x**10
- # the default "include_bias=True" adds a feature that's constantly 1
- poly = PolynomialFeatures(degree=10, include_bias=False)
- poly.fit(X)
- X_poly = poly.transform(X)
- print("X_poly.shape: {}".format(X_poly.shape))
Let's compare the entries of X_poly to those of X:
- print("Top 5 entries of X:\n{}\n".format(X[:5]))
- print("Top 5 entries of X_poly:\n{}\n".format(X_poly[:5]))
You can obtain the semantics of the features by calling the get_feature_names method, which provide the exponent for each feature:
- print("Polynomial feature names:\n{}\n".format(poly.get_feature_names()))
You can see that the first column of X_poly corresponds exactly to X, while the other columns are the powers of the first entry. It's interesting to see how large some of the values can get. The second column has entries above 20,000, orders of magnitude different than the rest. Using polynomial features together with a linear regression model yields the classic model of polynomial regression (see Figure 4-5):
- ch5_t07.py
- #!/usr/bin/env python
- import numpy as np
- import mglearn
- from sklearn.linear_model import LinearRegression
- from sklearn.tree import DecisionTreeRegressor
- X, y = mglearn.datasets.make_wave(n_samples=100)
- line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
- reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
- bins = np.linspace(-3, 3, 11)
- print "bins: {}".format(bins)
- which_bin = np.digitize(X, bins=bins)
- print "\nData points:\n%s\n" % X[:5]
- print "\nBin membership for data points:\n%s\n" % which_bin[:5]
- from sklearn.preprocessing import PolynomialFeatures
- # Include polynomials up to x**10
- # the default "include_bias=True" adds a feature that's constantly 1
- poly = PolynomialFeatures(degree=10, include_bias=False)
- poly.fit(X)
- X_poly = poly.transform(X)
- print("X_poly.shape: {}".format(X_poly.shape))
- print("Top 5 entries of X:\n{}\n".format(X[:5]))
- print("Top 5 entries of X_poly:\n{}\n".format(X_poly[:5]))
- print("Polynomial feature names:\n{}\n".format(poly.get_feature_names()))
- reg = LinearRegression().fit(X_poly, y)
- line_poly = poly.transform(line)
- import matplotlib.pyplot as plt
- plt.plot(line, reg.predict(line_poly), label='Polynomial linear regression')
- plt.plot(X[:,0], y, 'o', c='k')
- plt.ylabel("Regression output")
- plt.xlabel("Input feature")
- plt.legend(loc='best')
- plt.show()
As you can see, polynomial features yield a very smooth fit on this one-dimensional data. However, polynomials of high degree tend to behave in extreme ways on the boundaries or in regions with little data. As a comparison, here is a kernel SVM model (svm.SVR) learned on the original data, without any transformation (see Figure 4-6):
- ch5_t08.py
- #!/usr/bin/env python
- import numpy as np
- import mglearn
- X, y = mglearn.datasets.make_wave(n_samples=100)
- line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
- reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
- bins = np.linspace(-3, 3, 11)
- print "bins: {}".format(bins)
- which_bin = np.digitize(X, bins=bins)
- print "\nData points:\n%s\n" % X[:5]
- print "\nBin membership for data points:\n%s\n" % which_bin[:5]
- from sklearn.svm import SVR
- import matplotlib.pyplot as plt
- for gamma in [1, 10]:
- svr = SVR(gamma=gamma).fit(X, y)
- plt.plot(line, svr.predict(line), label='SVR gamma={}'.format(gamma))
- plt.plot(X[:,0], y, 'o', c='k')
- plt.ylabel("SVR output")
- plt.xlabel("Input feature")
- plt.legend(loc='best')
- plt.show()
Using a more complex model, a kernel SVM, we can able to learn a similarity complex prediction to the polynomial regression without an explicit transformation of the features. As a more realistic application of interactions and polynomials, let's look again at the Boston Housing dataset. We already used polynomial features on this dataset before (Chapter2 - Linear model). Now let's have a look at how these features were constructed, and at how much the polynomial features help. First we load the data, and rescale it to be between 0 and 1 using MinMaxScaler:
- import numpy as np
- import mglearn
- from sklearn.datasets import load_boston
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import MinMaxScaler
- boston = load_boston()
- X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)
- # Rescale data
- scaler = MinMaxScaler()
- X_train_scaled = scaler.fit_transform(X_train)
- X_test_scaled = scaler.fit_transform(X_test)
- from sklearn.preprocessing import PolynomialFeatures
- poly = PolynomialFeatures(degree=2).fit(X_train_scaled)
- X_train_poly = poly.transform(X_train_scaled)
- X_test_poly = poly.transform(X_test_scaled)
- print("X_train.shape={}".format(X_train.shape))
- print("X_train_poly.shape={}".format(X_train_poly.shape))
The data originally had 13 features, which were expanded to 105 interaction features. These new features represent all possible interactions between two different original features, as well as the square of each original feature. degree=2 here means that we look at all features are the product of up to two original features. The exact correspondence between input and output features can be found using the get_feature_names method:
- print("Polynomial feature names:\n{}\n".format(poly.get_feature_names()))
The first new feature is a constant feature, called "1" here. The next 13 features are the original features (called "x0" to "x12"). Then follows the first feature squared ("x0^2") and combinations of the first and the other features. Let's compare the performance using Ridge on the data with and without interactions:
- from sklearn.linear_model import Ridge
- ridge = Ridge().fit(X_train_scaled, y_train)
- print("Score without interactions: {:.3f}".format(ridge.score(X_test_scaled, y_test)))
- ridge = Ridge().fit(X_train_poly, y_train)
- print("Score with interactions: {:.3f}".format(ridge.score(X_test_poly, y_test)))
Clearly, the interactions and polynomial features gave us a good boost in performance when using Ridge. When using a more complex model like a random forest (RandomForestRegressor), the story is a bit different, though:
- from sklearn.ensemble import RandomForestRegressor
- rf = RandomForestRegressor(n_estimators=100).fit(X_train_scaled, y_train)
- print("Score without interactions: {:.3f}".format(rf.score(X_test_scaled, y_test)))
- rf = RandomForestRegressor(n_estimators=100).fit(X_train_poly, y_train)
- print("Score with interactions: {:.3f}".format(rf.score(X_test_poly, y_test)))
You can see that even without additional features, the random forest beats the performance of Ridge. Adding interactions and polynomials actually decreases performance slightly.
Univariate Nonlinear Transformations
We just saw that adding squared or cubed features can help linear models for regression. There are other transformations that often prove useful for transforming certain features: in particular, applying mathematical functions like log, exp, or sin. While tree-based models only care about the ordering of the features, linear models and neural networks are very tied to the scale and distribution of each feature, and if there is a nonlinear relation between the feature and the target, that becomes hard to model - particularly in regression. The functions log and exp can help by adjusting the relative scales in the data so that they can captured better by a linear model or neural network. We saw an application of that in Chapter 2 with memory price data. The sin and cos functions can come in handy when dealing with data that encodes periodic parameters.
Most models work best when each feature (and in regression also the target) is loosely Gaussian distributed - that is , a histogram of each feature should have something resembling the familiar "bell curve" shape. Using transformations like log and exp is a hacky but simple and efficient way to achieve this. A particularly common case when such a transformation can be helpful is when dealing with integer count data. By count data, we mean features like "how often did user A log in?" Counts are never negative, and often follow particular statistic patterns. We are using a synthetic dataset of counts here that has properties similar to those you can find in the wild. The features are all integer-valued, while the response is continuous.
- import numpy as np
- import mglearn
- from sklearn.datasets import load_boston
- from sklearn.model_selection import train_test_split
- from sklearn.preprocessing import MinMaxScaler
- rnd = np.random.RandomState(0)
- X_org = rnd.normal(size=(1000, 3))
- w = rnd.normal(size=3)
- X = rnd.poisson(10 * np.exp(X_org))
- y = np.dot(X_org, w)
- print("Number of eature appearance:\n{}\n".format(np.bincount(X[:,0])))
The value 2 seems to be the most common, with 6 appearances (bincount always starts at 0), and the counts for higher values fall quickly. However, there are some very high values, like 84 and 85, that are appearing twice. We visualize the counts in Figure 4-7:
- bins = np.bincount(X[:,0])
- import matplotlib.pyplot as plt
- plt.bar(range(len(bins)), bins, color='w')
- plt.ylabel("Number of appearance")
- plt.xlabel("Value")
- plt.show()
Features X[:1] and X[:2] have similar properties. This kind of distribution of values (many small ones and a few vary large ones) is very common in practice. However, it is something most linear models can't handle very well. Let's try to fit a Ridge regression to this model as below sample code:
- from sklearn.linear_model import Ridge
- X_train, X_test,y_train,y_test = train_test_split(X, y, random_state=0)
- ridge = Ridge().fit(X_train, y_train)
- print("Score of Ridge:{:.3f}\n".format(ridge.score(X_test, y_test)))
As you can see from the relatively low R^2 score, Ridge was not able to really capture the relationship between X and y. Applying a logarithmic transformation can help, through. Because the value 0 appears in the data (and the logarithm is not defined as 0), we can't actually apply log, but we have to compute log(X + 1):
- X_train_log = np.log(X_train + 1)
- X_test_log = np.log(X_test + 1)
- import matplotlib.pyplot as plt
- plt.hist(X_train_log[:,0], bins=25, color='gray')
- plt.ylabel("Number of appearance")
- plt.xlabel("Value")
- plt.show()
Building a Ridge model on the new data provides a much better fit:
- ridge = Ridge().fit(X_train_log, y_train)
- print("Score of Ridge(log version):{:.3f}\n".format(ridge.score(X_test_log, y_test)))
Finding the transformation that works best for each combination of dataset and model is somewhat of an art. In this example, all the features had the same properties. This is rarely the case in practice, and usually only a subset of the features should be transformed, or sometimes each feature needs to be transformed in a different way. As we mentioned earlier, these kinds of transformations are irrelevant for tree-based models but might be essential for linear models. Sometimes it is also a good idea to transform the target variable y in regression. Trying to predict counts (say, number of orders) is a fairly common task, and using the log(y+1) transformation often helps.
As you saw in the previous example, binning, polynomials, and interactions can have a huge influence on how models perform on a given dataset. This is particular true for less complex models like linear models and naive Bayes models. Tree-based models, on the other hand, are often able to discover important interaction themselves, and don't require transforming the data explicitly most of the time. Other models, like SVMs, nearest neighbors, and neural networks, might sometimes benefit from using binning, interactions, or polynomials, but the implications there are usually much less clear than in the case of linear models.
沒有留言:
張貼留言