程式扎記: [ Intro2ML ] Ch5. Representing Data and Engineering Features

Interactions and Polynomials
Another way to enrich a feature representation, in particular for linear models, is adding interaction features and polynomial features of the original data. This kind of feature engineering is often used in statistical modelling, but also common in many practical machine learning applications. As a first example, look again at Figure linear_binning above. The linear model learned a constant value for each bin on the wave dataset. We know, however, that linear models can not only learn offsets, but also slopes. One way to add a slope to the linear model on the binned data, is to add the original feature (the x axis in the plot) back in.

This leads to a 11 dimensional dataset, as seen in Figure 4-3:

>>> from ch5_t04 import *
>>> X_combined = np.hstack([X, X_binned]) # API:hstack stack arrays in sequence horizontally (column wise).
>>> print(X_combined.shape)
(100, 11)
>>> X.shape
(100, 1)
>>> X_binned.shape
(100, 10)

- ch5_t05.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
import numpy as np  
import mglearn  
  
from sklearn.linear_model import LinearRegression  
from sklearn.tree import DecisionTreeRegressor  
X, y = mglearn.datasets.make_wave(n_samples=100)  
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)  
reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)  
  
bins = np.linspace(-3, 3, 11)  
print "bins: {}".format(bins)  
  
which_bin = np.digitize(X, bins=bins)  
print "\nData points:\n%s\n" % X[:5]  
print "\nBin membership for data points:\n%s\n" % which_bin[:5]  
  
from sklearn.preprocessing import OneHotEncoder  
  
# Transform using the OneHotEncoder  
encoder = OneHotEncoder(sparse=False)  
  
# encoder.fit finds the unique values that appear in what_bin  
encoder.fit(which_bin)  
  
# Transform creates the one-hot encoding  
X_binned = encoder.transform(which_bin)  
print "Top 5 rows in X_binned:\n%s\n" % X_binned[:5]  
  
print "X_binned.shape = %s\n" % str(X_binned.shape)  
  
line_binned = encoder.transform(np.digitize(line, bins=bins))  
print "line_binned:\n%s\n" % (line_binned[:10])  
X_combined = np.hstack([X, X_binned])  
print(X_combined.shape)  
reg = LinearRegression().fit(X_combined, y)  
  
import matplotlib.pyplot as plt  
line_combined = np.hstack([line, line_binned])  
plt.plot(line, reg.predict(line_combined), label='linear regression binned')  
  
for bin in bins:  
    plt.plot([bin, bin], [-3, 3], ':', c='k')  
  
plt.plot(X[:, 0], y, 'o', c='k')  
plt.legend(loc="best")  
plt.ylabel("Regression output")  
plt.xlabel("Input feature")  
plt.show()  

Figure 4-3. Linear regression using binned features and a single global slope

In this example, the model learned an offset for each bin, together with a slope. The learned slope is downward, and shared access all the bins - there is a single x-axis feature, which has a single slope. Because the slope is shared across all bins, it doesn't seem to be very helpful. We would rather have a separate slope for each bin! We can achieve this by adding an interaction or product feature feature that indicates which bin a data point is in and where it lies on the x-axis. This feature is a product of the bin indicator and the original feature. Let's create this dataset:

view plaincopy to clipboardprint?
X_product = np.hstack([X_binned, X * X_binned])  
print(X_product.shape)  

Execution output:

(100, 20)

The dataset now has 20 features: the indicator for which bin a data point is in, and a product of the original feature and the bin indicator. You can think of the product feature as a separate copy of the x-axis feature for each bin. It is the original feature within the bin, and zero everywhere else. Figure 4.4 shows the result of the linear model on this new representation:
- ch5_t06.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
import numpy as np  
import mglearn  
  
from sklearn.linear_model import LinearRegression  
from sklearn.tree import DecisionTreeRegressor  
X, y = mglearn.datasets.make_wave(n_samples=100)  
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)  
reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)  
  
bins = np.linspace(-3, 3, 11)  
print "bins: {}".format(bins)  
  
which_bin = np.digitize(X, bins=bins)  
print "\nData points:\n%s\n" % X[:5]  
print "\nBin membership for data points:\n%s\n" % which_bin[:5]  
  
from sklearn.preprocessing import OneHotEncoder  
  
# Transform using the OneHotEncoder  
encoder = OneHotEncoder(sparse=False)  
  
# encoder.fit finds the unique values that appear in what_bin  
encoder.fit(which_bin)  
  
# Transform creates the one-hot encoding  
X_binned = encoder.transform(which_bin)  
print "Top 5 rows in X_binned:\n%s\n" % X_binned[:5]  
  
print "X_binned.shape = %s\n" % str(X_binned.shape)  
  
line_binned = encoder.transform(np.digitize(line, bins=bins))  
print "line_binned:\n%s\n" % (line_binned[:10])  
#X_combined = np.hstack([X, X_binned])  
X_product = np.hstack([X_binned, X * X_binned])  
print(X_product.shape)  
  
reg = LinearRegression().fit(X_product, y)  
  
import matplotlib.pyplot as plt  
line_product = np.hstack([line_binned, line * line_binned])  
plt.plot(line, reg.predict(line_product), label='linear regression product')  
  
for bin in bins:  
    plt.plot([bin, bin], [-3, 3], ':', c='k')  
  
plt.plot(X[:, 0], y, 'o', c='k')  
plt.legend(loc="best")  
plt.ylabel("Regression output")  
plt.xlabel("Input feature")  
plt.show()  

Figure 4-4. Linear regression with a separate slope per bin

As you can see, now each bin has its own offset and slope in this model. Using binning is one way to expand a continuous feature. Another one is to use polynomials of the original features. For a given feature x, we might want to consider x**2, x**3, x**4 and so on. This is implemented in PolynomialFeatures in the preprocessing module:

view plaincopy to clipboardprint?
from sklearn.preprocessing import PolynomialFeatures  
  
# Include polynomials up to x**10  
# the default "include_bias=True" adds a feature that's constantly 1  
poly = PolynomialFeatures(degree=10, include_bias=False)  
poly.fit(X)  
X_poly = poly.transform(X)  
print("X_poly.shape: {}".format(X_poly.shape))  

Output:

X_poly.shape: (100, 10)

Let's compare the entries of X_poly to those of X:

view plaincopy to clipboardprint?
print("Top 5 entries of X:\n{}\n".format(X[:5]))  
print("Top 5 entries of X_poly:\n{}\n".format(X_poly[:5]))  

Output:

You can obtain the semantics of the features by calling the get_feature_names method, which provide the exponent for each feature:

view plaincopy to clipboardprint?
print("Polynomial feature names:\n{}\n".format(poly.get_feature_names()))  

Output:

Polynomial feature names:
['x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7', 'x0^8', 'x0^9', 'x0^10']

You can see that the first column of X_poly corresponds exactly to X, while the other columns are the powers of the first entry. It's interesting to see how large some of the values can get. The second column has entries above 20,000, orders of magnitude different than the rest. Using polynomial features together with a linear regression model yields the classic model of polynomial regression (see Figure 4-5):
- ch5_t07.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
import numpy as np  
import mglearn  
  
from sklearn.linear_model import LinearRegression  
from sklearn.tree import DecisionTreeRegressor  
X, y = mglearn.datasets.make_wave(n_samples=100)  
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)  
reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)  
  
bins = np.linspace(-3, 3, 11)  
print "bins: {}".format(bins)  
  
which_bin = np.digitize(X, bins=bins)  
print "\nData points:\n%s\n" % X[:5]  
print "\nBin membership for data points:\n%s\n" % which_bin[:5]  
  
  
from sklearn.preprocessing import PolynomialFeatures  
  
# Include polynomials up to x**10  
# the default "include_bias=True" adds a feature that's constantly 1  
poly = PolynomialFeatures(degree=10, include_bias=False)  
poly.fit(X)  
X_poly = poly.transform(X)  
print("X_poly.shape: {}".format(X_poly.shape))  
  
print("Top 5 entries of X:\n{}\n".format(X[:5]))  
print("Top 5 entries of X_poly:\n{}\n".format(X_poly[:5]))  
  
print("Polynomial feature names:\n{}\n".format(poly.get_feature_names()))  
  
reg = LinearRegression().fit(X_poly, y)  
line_poly = poly.transform(line)  
  
import matplotlib.pyplot as plt  
plt.plot(line, reg.predict(line_poly), label='Polynomial linear regression')  
plt.plot(X[:,0], y, 'o', c='k')  
plt.ylabel("Regression output")  
plt.xlabel("Input feature")  
plt.legend(loc='best')  
plt.show()  

Figure 4-5. Linear regression with tenth-degree polynomial features

As you can see, polynomial features yield a very smooth fit on this one-dimensional data. However, polynomials of high degree tend to behave in extreme ways on the boundaries or in regions with little data. As a comparison, here is a kernel SVM model (svm.SVR) learned on the original data, without any transformation (see Figure 4-6):
- ch5_t08.py

view plaincopy to clipboardprint?
#!/usr/bin/env python  
import numpy as np  
import mglearn  
  
X, y = mglearn.datasets.make_wave(n_samples=100)  
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)  
reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)  
  
bins = np.linspace(-3, 3, 11)  
print "bins: {}".format(bins)  
  
which_bin = np.digitize(X, bins=bins)  
print "\nData points:\n%s\n" % X[:5]  
print "\nBin membership for data points:\n%s\n" % which_bin[:5]  
  
  
from sklearn.svm import SVR  
import matplotlib.pyplot as plt  
  
for gamma in [1, 10]:  
    svr = SVR(gamma=gamma).fit(X, y)  
    plt.plot(line, svr.predict(line), label='SVR gamma={}'.format(gamma))  
plt.plot(X[:,0], y, 'o', c='k')  
plt.ylabel("SVR output")  
plt.xlabel("Input feature")  
plt.legend(loc='best')  
plt.show()  

Figure 4-6. Comparison of different gamma parameters for an SVM with RBE kernel

Using a more complex model, a kernel SVM, we can able to learn a similarity complex prediction to the polynomial regression without an explicit transformation of the features. As a more realistic application of interactions and polynomials, let's look again at the Boston Housing dataset. We already used polynomial features on this dataset before (Chapter2 - Linear model). Now let's have a look at how these features were constructed, and at how much the polynomial features help. First we load the data, and rescale it to be between 0 and 1 using MinMaxScaler:

view plaincopy to clipboardprint?
import numpy as np  
import mglearn  
from sklearn.datasets import load_boston  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import MinMaxScaler  
  
boston = load_boston()  
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=0)  
  
# Rescale data  
scaler = MinMaxScaler()  
X_train_scaled = scaler.fit_transform(X_train)  
X_test_scaled = scaler.fit_transform(X_test)  

Now, we extract polynomial features and interactions up to a degree of 2:

view plaincopy to clipboardprint?
from sklearn.preprocessing import PolynomialFeatures  
  
poly = PolynomialFeatures(degree=2).fit(X_train_scaled)  
X_train_poly = poly.transform(X_train_scaled)  
X_test_poly = poly.transform(X_test_scaled)  
print("X_train.shape={}".format(X_train.shape))  
print("X_train_poly.shape={}".format(X_train_poly.shape))  

Output:

X_train.shape=(379, 13)
X_train_poly.shape=(379, 105)

The data originally had 13 features, which were expanded to 105 interaction features. These new features represent all possible interactions between two different original features, as well as the square of each original feature. degree=2 here means that we look at all features are the product of up to two original features. The exact correspondence between input and output features can be found using the get_feature_names method:

view plaincopy to clipboardprint?
print("Polynomial feature names:\n{}\n".format(poly.get_feature_names()))  

Output:

Polynomial feature names:
['1', 'x0', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10', 'x11', 'x12', 'x0^2', 'x0 x1', 'x0 x2'... 'x9 x12', 'x10^2', 'x10 x11', 'x10 x12', 'x11^2', 'x11 x12', 'x12^2']

The first new feature is a constant feature, called "1" here. The next 13 features are the original features (called "x0" to "x12"). Then follows the first feature squared ("x0^2") and combinations of the first and the other features. Let's compare the performance using Ridge on the data with and without interactions:

view plaincopy to clipboardprint?
from sklearn.linear_model import Ridge  
ridge = Ridge().fit(X_train_scaled, y_train)  
print("Score without interactions: {:.3f}".format(ridge.score(X_test_scaled, y_test)))  
ridge = Ridge().fit(X_train_poly, y_train)  
print("Score with interactions: {:.3f}".format(ridge.score(X_test_poly, y_test)))  

Output:

Score without interactions: 0.577
Score with interactions: 0.741

Clearly, the interactions and polynomial features gave us a good boost in performance when using Ridge. When using a more complex model like a random forest (RandomForestRegressor), the story is a bit different, though:

view plaincopy to clipboardprint?
from sklearn.ensemble import RandomForestRegressor  
rf = RandomForestRegressor(n_estimators=100).fit(X_train_scaled, y_train)  
print("Score without interactions: {:.3f}".format(rf.score(X_test_scaled, y_test)))  
rf = RandomForestRegressor(n_estimators=100).fit(X_train_poly, y_train)  
print("Score with interactions: {:.3f}".format(rf.score(X_test_poly, y_test)))  

Output:

Score without interactions: 0.799
Score with interactions: 0.761

You can see that even without additional features, the random forest beats the performance of Ridge. Adding interactions and polynomials actually decreases performance slightly.

Univariate Nonlinear Transformations
We just saw that adding squared or cubed features can help linear models for regression. There are other transformations that often prove useful for transforming certain features: in particular, applying mathematical functions like log, exp, or sin. While tree-based models only care about the ordering of the features, linear models and neural networks are very tied to the scale and distribution of each feature, and if there is a nonlinear relation between the feature and the target, that becomes hard to model - particularly in regression. The functions log and exp can help by adjusting the relative scales in the data so that they can captured better by a linear model or neural network. We saw an application of that in Chapter 2 with memory price data. The sin and cos functions can come in handy when dealing with data that encodes periodic parameters.

Most models work best when each feature (and in regression also the target) is loosely Gaussian distributed - that is , a histogram of each feature should have something resembling the familiar "bell curve" shape. Using transformations like log and exp is a hacky but simple and efficient way to achieve this. A particularly common case when such a transformation can be helpful is when dealing with integer count data. By count data, we mean features like "how often did user A log in?" Counts are never negative, and often follow particular statistic patterns. We are using a synthetic dataset of counts here that has properties similar to those you can find in the wild. The features are all integer-valued, while the response is continuous.

view plaincopy to clipboardprint?
import numpy as np  
import mglearn  
from sklearn.datasets import load_boston  
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import MinMaxScaler  
  
rnd = np.random.RandomState(0)  
X_org = rnd.normal(size=(1000, 3))  
w = rnd.normal(size=3)  
  
X = rnd.poisson(10 * np.exp(X_org))  
y = np.dot(X_org, w)  

Let's take a look at the first 10 entries of the first feature. All are integer values and positive, but apart from that it's hard to make out a particular pattern. If we count the appearance of each value, the distribution of values becomes clear:

view plaincopy to clipboardprint?
print("Number of eature appearance:\n{}\n".format(np.bincount(X[:,0])))  

The value 2 seems to be the most common, with 6 appearances (bincount always starts at 0), and the counts for higher values fall quickly. However, there are some very high values, like 84 and 85, that are appearing twice. We visualize the counts in Figure 4-7:

view plaincopy to clipboardprint?
bins = np.bincount(X[:,0])  
import matplotlib.pyplot as plt  
plt.bar(range(len(bins)), bins, color='w')  
plt.ylabel("Number of appearance")  
plt.xlabel("Value")  
plt.show()  

Figure 4-7. Histogram of feature values for X[0]

Features X[:1] and X[:2] have similar properties. This kind of distribution of values (many small ones and a few vary large ones) is very common in practice. However, it is something most linear models can't handle very well. Let's try to fit a Ridge regression to this model as below sample code:

view plaincopy to clipboardprint?
from sklearn.linear_model import Ridge  
X_train, X_test,y_train,y_test = train_test_split(X, y, random_state=0)  
ridge = Ridge().fit(X_train, y_train)  
print("Score of Ridge:{:.3f}\n".format(ridge.score(X_test, y_test)))  

Output:

Score of Ridge:0.622

As you can see from the relatively low R^2 score, Ridge was not able to really capture the relationship between X and y. Applying a logarithmic transformation can help, through. Because the value 0 appears in the data (and the logarithm is not defined as 0), we can't actually apply log, but we have to compute log(X + 1):

view plaincopy to clipboardprint?
X_train_log = np.log(X_train + 1)  
X_test_log = np.log(X_test + 1)  

After the transformation, the distribution of the data is less asymmetrical and doesn't have very large outliers anymore (see Figure 4-8):

view plaincopy to clipboardprint?
import matplotlib.pyplot as plt  
plt.hist(X_train_log[:,0], bins=25, color='gray')  
plt.ylabel("Number of appearance")  
plt.xlabel("Value")  
plt.show()  

Figure 4-8. Histogram of feature values for X[0] after logarithm transformation

Building a Ridge model on the new data provides a much better fit:

view plaincopy to clipboardprint?
ridge = Ridge().fit(X_train_log, y_train)  
print("Score of Ridge(log version):{:.3f}\n".format(ridge.score(X_test_log, y_test)))  

Output:

Score of Ridge(log version):0.875

Finding the transformation that works best for each combination of dataset and model is somewhat of an art. In this example, all the features had the same properties. This is rarely the case in practice, and usually only a subset of the features should be transformed, or sometimes each feature needs to be transformed in a different way. As we mentioned earlier, these kinds of transformations are irrelevant for tree-based models but might be essential for linear models. Sometimes it is also a good idea to transform the target variable y in regression. Trying to predict counts (say, number of orders) is a fairly common task, and using the log(y+1) transformation often helps.

As you saw in the previous example, binning, polynomials, and interactions can have a huge influence on how models perform on a given dataset. This is particular true for less complex models like linear models and naive Bayes models. Tree-based models, on the other hand, are often able to discover important interaction themselves, and don't require transforming the data explicitly most of the time. Other models, like SVMs, nearest neighbors, and neural networks, might sometimes benefit from using binning, interactions, or polynomials, but the implications there are usually much less clear than in the case of linear models.

程式扎記

標籤

2017年3月21日星期二

[ Intro2ML ] Ch5. Representing Data and Engineering Features - Part2

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2017年3月21日 星期二

[ Intro2ML ] Ch5. Representing Data and Engineering Features - Part2

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2017年3月21日星期二