2017年3月26日 星期日

[ Intro2ML ] Ch5. Representing Data and Engineering Features - Part3

Automatic Feature Selection 
With so many ways to create new features, you might get tempted to increase the dimensionality of the data way beyond the number of original features. However, adding more features makes all models more complex, and so increases the chance of overfitting. When adding new features, or with high-dimensional datasets in general, it can be a good idea to reduce the number of features to only the most useful ones, and discard the rest. This can lead to simpler models that generalize better. 

But how can you know how good each feature is? There are three basic strategies: Univariate statisticsmodel-based selection and iterative selection. We will discuss all three of them in detail. All three of these methods are supervised methods, meaning they need the target for fitting the model. This means we do need to split the data into training and test set and fit the feature selection only on the training part of the data. 

Univariate statistics 
In univariate statistics, we compute whether there is a statistically significant relationship between each feature and the target. Then the features that are related with the highest confidence are selected. In the case of classification, this is also known as analysis of variance (ANOVA). A key property of these tests are that they are univariate meaning that they only consider each feature individually. Consequently a feature will be discarded if it is only informative when combined with another feature. Univariate tests are often very fast to compute, and don’t require building a model. On the other hand, they are completely independend of the model that you might want to apply after the feature selection. 

To use univariate feature selection in scikit-learn, you need to choose a test, usually either f_classif (the default) for classification or f_regression for regression, and a method to discard features based on the p-values determined in the test. All methods for discarding parameters use a threshold to discard all features with too high a p-values (which means they are unlikely to be related to the target). The methods differ in how they compute this threshold, with the simplest ones being SelectKBest which selects a fixed number k of features, and SelectPercentile, which selectes a fixed percentage of features. 

Let’s apply the feature selection for classification on the cancer dataset. To make the task a bit harder, we add some non-informative noise features to the data. We expect the feature selection to be able to identify the features that are non-informative and remove them. 
- ch5_t11.py 
  1. import numpy as np  
  2. from sklearn.datasets import load_breast_cancer  
  3. from sklearn.feature_selection import SelectPercentile  
  4. from sklearn.model_selection import train_test_split  
  5.   
  6. cancer = load_breast_cancer()  
  7. # get deterministic random numbers  
  8. rng = np.random.RandomState(42)  
  9. noise = rng.normal(size=(len(cancer.data), 50))  
  10.   
  11. # add noise features to the data  
  12. # the first 30 features are from the dataset, the next 50 are noise  
  13. X_w_noise = np.hstack([cancer.data, noise])  
  14.   
  15. X_train, X_test, y_train, y_test = train_test_split(X_w_noise, cancer.target, random_state=0, test_size=.5)  
  16.   
  17. # use f_classif (the default) and SelectPercentile to select 50% of features:  
  18. select = SelectPercentile(percentile=50)  
  19. select.fit(X_train, y_train)  
  20.   
  21. # transform training set:  
  22. X_train_selected = select.transform(X_train)  
  23. print(X_train.shape)  
  24. print(X_train_selected.shape)  
Output: 
(284, 80)
(284, 40)

As you can see, the number of features was reduced from 80 to 40 (50 percent of the original number of features). We can find out which features have been selected using the get_support method, which returns a boolean mask of the selected features: 
  1. import matplotlib.pyplot as plt  
  2. mask = select.get_support()  
  3. print(mask)  
  4. # visualize the mask. black is True, white is False  
  5. plt.matshow(mask.reshape(1, -1), cmap='gray_r')  
  6. plt.show()  

As you can see from the visualization of the mask above, most of the selected features are the original features, and most of the noise features were removed. However, the recovery of the original features is not perfect. Let’s compare the performance of logistic regression on all features against the performance using only the selected features: 
  1. from sklearn.linear_model import LogisticRegression  
  2.   
  3. # transform test data:  
  4. X_test_selected = select.transform(X_test)  
  5. X_otrain, X_otest, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0, test_size=.5)  
  6.   
  7. lr = LogisticRegression()  
  8. lr.fit(X_otrain, y_train)  
  9. print("Score with original features: %f" % lr.score(X_otest, y_test))  
  10. lr.fit(X_train, y_train)  
  11. print("Score with all features: %f" % lr.score(X_test, y_test))  
  12. lr.fit(X_train_selected, y_train)  
  13. print("Score with only selected features: %f" % lr.score(X_test_selected, y_test))  
Output: 
Score with original features: 0.954386
Score with all features: 0.929825
Score with only selected features: 0.940351

In this case, removing the noise features improved performance, even though some of the original features where lost. This was a very simple synthetic example, though, and outcomes on real data is usually mixed. Univariate feature selection can still be very helpful if there is such a large number of features that building a model on them is infeasible, or if you suspect that many features are completely uninformative. 

Model-based Feature Selection 
Model based feature selection uses a supervised machine learning model to judge the importance of each feature, and keeps only the most important ones. The supervised model that is used for feature selection doesn’t need to be the same model that is used for the final supervised modeling. 

The model that is used for feature selection needs to provide some measure of importance for each feature, so that they can be ranked by this measure. Decision trees and decision tree based models provide feature importances, which can be used; Linear models have coefficients which can be used by considering the absolute value. As we saw in Chapter 2, linear models with L1 penalty learn sparse coefficients (Lasso), which only use a small subset of features. This can be viewed as a form of feature selection for the model itself, but can also be used as a preprocessing step to select features for another model. 

In contrast to univariate selection, model-based selection considers all features at once, and so can capture interactions (if the model can capture them). To use model based feature selection, we need to use the SelectFromModel transformer: 
  1. import numpy as np  
  2. from sklearn.feature_selection import SelectFromModel  
  3. from sklearn.ensemble import RandomForestClassifier  
  4. from sklearn.model_selection import train_test_split  
  5. from sklearn.datasets import load_breast_cancer  
The SelectFromModel class selects all features that have an importance measure of the feature (as provided by the supervised model) greater than the provided threshold. To get a comparable result to what we got with univariate feature selection, we used the median as a threshold, so that half of the features will be selected. We use a random forest classifier with 100 trees to compute the feature importances. This is a quite complex model and much more powerful than using univariate tests. Now let’s actually fit the model: 
  1. select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), threshold="median")  
  2. cancer = load_breast_cancer()  
  3. # get deterministic random numbers  
  4. rng = np.random.RandomState(42)  
  5. noise = rng.normal(size=(len(cancer.data), 50))  
  6.   
  7. # add noise features to the data  
  8. # the first 30 features are from the dataset, the next 50 are noise  
  9. X_w_noise = np.hstack([cancer.data, noise])  
  10.   
  11. X_train, X_test, y_train, y_test = train_test_split(X_w_noise, cancer.target, random_state=0, test_size=.5)  
  12.   
  13. # use f_classif (the default) and SelectPercentile to select 50% of features:  
  14. select.fit(X_train, y_train)  
  15. X_train_l1 = select.transform(X_train)  
  16. print(X_train.shape)  
  17. print(X_train_l1.shape)  
Output: 
(284, 80)
(284, 40)

Again, we can have a look at the features that were selected: 
  1. import matplotlib.pyplot as plt  
  2. mask = select.get_support()  
  3. print(mask)  
  4. # visualize the mask. black is True, white is False  
  5. plt.matshow(mask.reshape(1, -1), cmap='gray_r')  
  6. plt.show()  

Now let's check the scores: 
  1. from sklearn.linear_model import LogisticRegression  
  2.   
  3. # transform test data:  
  4. X_test_l1 = select.transform(X_test)  
  5. X_otrain, X_otest, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0, test_size=.5)  
  6.   
  7. lr = LogisticRegression()  
  8. lr.fit(X_otrain, y_train)  
  9. print("Score with original features: %f" % lr.score(X_otest, y_test))  
  10. lr.fit(X_train, y_train)  
  11. print("Score with all features: %f" % lr.score(X_test, y_test))  
  12. lr.fit(X_train_l1, y_train)  
  13. print("Score with only selected features: %f" % lr.score(X_test_l1, y_test))  
Output: 
Score with original features: 0.954386
Score with all features: 0.929825
Score with only selected features: 0.950877

With the better feature selection, we also gained some improvements in performance. 

Iterative feature selection 
In univariate testing, we build used no model, while in model based selection we used a single model to select features. In iterative feature selection, a series of models is built, with varying numbers of features. There are two basic methods: starting with no features and adding features one by one, until some stopping criterion is reached, or starting with all features and removing features one by one, until some stopping criterion is reached. Because a series of models is built, these methods are much more computationally expensive then the methods we discussed above. One particular method of this kind is recursive feature elimination (RFE) which starts with all features, builds a model, and discards the least important feature according to the model. Then, a new model is built, using all but the discarded feature, and so on, until only a pre-specified number of features is left. For this to work, the model used for selection needs to provide some way to determine feature importance, as was the case for the model based selection. 

We use the same random forest model (RandomForestsClassifier) that we used above: 
  1. import numpy as np  
  2. from sklearn.ensemble import RandomForestClassifier  
  3. from sklearn.model_selection import train_test_split  
  4. from sklearn.datasets import load_breast_cancer  
  5. from sklearn.feature_selection import RFE  
  6.   
  7. select = RFE(RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=40)  
  8. cancer = load_breast_cancer()  
  9. # get deterministic random numbers  
  10. rng = np.random.RandomState(42)  
  11. noise = rng.normal(size=(len(cancer.data), 50))  
  12.   
  13. # add noise features to the data  
  14. # the first 30 features are from the dataset, the next 50 are noise  
  15. X_w_noise = np.hstack([cancer.data, noise])  
  16.   
  17. X_train, X_test, y_train, y_test = train_test_split(X_w_noise, cancer.target, random_state=0, test_size=.5)  
  18.   
  19. # use f_classif (the default) and SelectPercentile to select 50% of features:  
  20. select.fit(X_train, y_train)  
  21. X_train_rfe = select.transform(X_train)  
  22. print(X_train.shape)  
  23. print(X_train_rfe.shape)  
  24.   
  25.   
  26. import matplotlib.pyplot as plt  
  27. mask = select.get_support()  
  28. print(mask)  
  29. # visualize the mask. black is True, white is False  
  30. plt.matshow(mask.reshape(1, -1), cmap='gray_r')  
  31. plt.show()  

The feature selection got better compared to the univariate and model based selection, but still some feature was still missed. Running the above code takes significantly longer than the model based selection, because a random forest model is trained 40 times, once for each feature that is dropped. 
  1. from sklearn.linear_model import LogisticRegression  
  2.   
  3. # transform test data:  
  4. X_test_rfe = select.transform(X_test)  
  5. X_otrain, X_otest, y_train, y_test = train_test_split(cancer.data, cancer.target, random_state=0, test_size=.5)  
  6.   
  7. lr = LogisticRegression()  
  8. lr.fit(X_otrain, y_train)  
  9. print("Score with original features: %f" % lr.score(X_otest, y_test))  
  10. lr.fit(X_train, y_train)  
  11. print("Score with all features: %f" % lr.score(X_test, y_test))  
  12. lr.fit(X_train_rfe, y_train)  
  13. print("Score with only selected features: %f" % lr.score(X_test_rfe, y_test))  
Output: 
Score with original features: 0.954386
Score with all features: 0.929825
Score with only selected features: 0.950877

If you are unsure when selecting what to use as input to your machine learning algorithms, automatic feature selection can be quite helpful. It is also great to reduce the amount of features needed, for example to speed up prediction, or allow for more interpretable models. In most real-world cases, applying feature selection is unlikely to provide large gains in performance. However, it is still a valuable tool in the toolbox of the feature engineer

Utilizing Expert Knowledge 
Feature engineering is often an important place to use expert knowledge for a particular application. While the purpose of machine learning often is to avoid having to create a set of expert-designed rules, that doesn’t mean that prior knowledge of the application or domain should be discarded. Often, domain experts can help in identifying useful features that are much more informative than the initial representation of the data. 

Imagine you are a travel agency and want to predict flight prices. Let’s say we have a record of prices together with date, airline, start location and destination. A machine learning model might be able to build a decent model from that. Some important factors in flight prices, however can not be learned. For example, flights are usually more expensive during school holidays or around public holidays. While some holidays can potentially be learned from the dates, like Christmas, others might depend on the phases of the moon (like Hannukah and Easter), or be set by authorities like school holidays. These events can not be learned from the data if each flight is only recorded using the (Gregorian) date. It is easy to add a feature that encodes whether a flight was on, preceding, or following a public or school holiday. In this way, prior knowledge about the nature of the task can be encoded in the features to aid a machine learning algorithm. Adding a feature does not force a machine learning algorithm to use it, and even if the holiday information turns out to be non-informative for flight prices, augmenting the data with this information doesn’t hurt. 

We’ll now go through one particular case of using expert knowledge - though in this case is might be more rightfully called “common sense”. The task is predicting citibike rentals in front of Andreas’ house. In New York, there is a network of bicycle rental stations, with a subscription system. The stations are all over the city and provide a convenient way to get around. Bike rental data is made public in an anonymized form and has been analyzed in various ways. 

The task we want to solve is to predict for a given time and day how many people will rent a bike in front of Andreas’ house - so he knows if any bikes will be left for him. We first load the data for August 2015 of this particular station as a pandas dataframe. We resampled the data into 3 hour intervals to obtain the main trends for each day. 
>>> import mglearn.datasets
>>> citibike = mglearn.datasets.load_citibike()
>>> citibike.__class__

>>> citibike.head() // Show top 5 records with period as 3 hours
starttime
2015-08-01 00:00:00 3.0
2015-08-01 03:00:00 0.0
2015-08-01 06:00:00 9.0
2015-08-01 09:00:00 41.0
2015-08-01 12:00:00 39.0
Freq: 3H, Name: one, dtype: float64

Below is a visualization of the rental frequencies for the whole month: 
- ch5_t14.py 
  1. import numpy as np  
  2. from sklearn.feature_selection import RFE  
  3.   
  4. import pandas as pd  
  5. import mglearn.datasets  
  6. citibike = mglearn.datasets.load_citibike()  
  7.   
  8. import matplotlib.pyplot as plt  
  9. plt.figure(figsize=(103))  
  10. xticks = pd.date_range(start=citibike.index.min(), end=citibike.index.max(), freq='D')  
  11. plt.xticks(xticks, xticks.strftime("%a %m-%d"), rotation=90, ha="left")  
  12. plt.plot(citibike, linewidth=2)  
  13. plt.show()  
 Figure 5-12. Number of bike rentals over time for a selected Citi Bike station 

Looking at the data, we can clearly distinguish day and night for each 24-hour interval. The patterns for weekdays and weekends also seem to be quite different. When evaluating a prediction task on a time series like this,we usually want to learn from the past and predict for the future.This means when doing a split into a training and a test set, we want to use all the data up to a certain date as the training set and all the data past that date as the test set. This is how we would usually use time series prediction: given everything that we know about rentals in the past, what do we think will happen tomorrow? We will use the first 184 data points, corresponding to the first 23 days, as our training set,and the remaining 64 data points, corresponding to the remaining 8 days,as our test set. 

The only feature that we are using in our prediction task is the date and time when a particular number of rentals occurred. So, the input feature is the date and time—say, 2015-08-01 00:00:00—and the output is the number of rentals in the following three hours (three in this case, according to our DataFrame).A (surprisingly) common way that dates are stored on computers is using POSIX time, which is the number of seconds since January 1970 00:00:00 (aka the beginning of Unix time). As a first try, we can use this single integer feature as our data representation: 
>>> from ch5_t14 import *
>>> citibike.values // number of rentals
array([ 3., 0., 9., 41., 39., 27., 12., 4., 3., 4., 6., ...)]
>>> citibike.index // Date/time
DatetimeIndex(['2015-08-01 00:00:00', '2015-08-01 03:00:00', ...])
>>> np.array(citibike.index.astype("int64")).reshape(-1,1).shape
(248, 1)
>>> np.array(citibike.index.astype("int64")).shape
(248,)

  1. # extract the target values (number of rentals)  
  2. y = citibike.values  
  3. # convert to POSIX time by dividing by 10**9  
  4. X = np.array(citibike.index.astype("int64")).reshape(-11// 10**9  
We first define a function to split the data into training and test sets,build the model, and visualize the result: 
  1. # use the first 184 data points for training, and the rest for testing  
  2. n_train = 184  
  3.   
  4. # function to evaluate and plot a regressor on a given feature set  
  5. def eval_on_features(features, target, regressor):  
  6.     # split the given features into a training and a test set  
  7.     X_train, X_test = features[:n_train], features[n_train:]  
  8.     # also split the target array  
  9.     y_train, y_test = target[:n_train], target[n_train:]  
  10.     regressor.fit(X_train, y_train)  
  11.     print("Test-set R^2: {:.2f}".format(regressor.score(X_test, y_test)))  
  12.     y_pred = regressor.predict(X_test)  
  13.     y_pred_train = regressor.predict(X_train)  
  14.     import matplotlib.pyplot as plt  
  15.     plt.figure(figsize=(103))  
  16.   
  17.     plt.xticks(range(0, len(X), 8), xticks.strftime("%a %m-%d"), rotation=90,  
  18.                ha="left")  
  19.   
  20.     plt.plot(range(n_train), y_train, label="train")  
  21.     plt.plot(range(n_train, len(y_test) + n_train), y_test, '-', label="test")  
  22.     plt.plot(range(n_train), y_pred_train, '--', label="prediction train")  
  23.   
  24.     plt.plot(range(n_train, len(y_test) + n_train), y_pred, '--',  
  25.              label="prediction test")  
  26.     plt.legend(loc=(1.010))  
  27.     plt.xlabel("Date")  
  28.     plt.ylabel("Rentals")  
We saw earlier that random forests require very little preprocessing of the data, which makes this seem like a good model to start with. We use the POSIX time feature X and pass a random forest regressor to our eval_on_featuresfunction. Figure 5-13 shows the result: 
- ch5-15.py 
  1. import numpy as np  
  2. from sklearn.feature_selection import RFE  
  3.   
  4. import pandas as pd  
  5. import mglearn.datasets  
  6. citibike = mglearn.datasets.load_citibike()  
  7.   
  8. # extract the target values (number of rentals)  
  9. y = citibike.values  
  10. # convert to POSIX time by dividing by 10**9  
  11. X = np.array(citibike.index.astype("int64")).reshape(-11// 10**9  
  12.   
  13. # use the first 184 data points for training, and the rest for testing  
  14. n_train = 184  
  15.   
  16. # function to evaluate and plot a regressor on a given feature set  
  17. def eval_on_features(features, target, regressor):  
  18.     # split the given features into a training and a test set  
  19.     X_train, X_test = features[:n_train], features[n_train:]  
  20.     # also split the target array  
  21.     y_train, y_test = target[:n_train], target[n_train:]  
  22.     regressor.fit(X_train, y_train)  
  23.     print("Test-set R^2: {:.2f}".format(regressor.score(X_test, y_test)))  
  24.     y_pred = regressor.predict(X_test)  
  25.     y_pred_train = regressor.predict(X_train)  
  26.     import matplotlib.pyplot as plt  
  27.     #from time import gmtime, strftime  
  28.     plt.figure(figsize=(103))  
  29.   
  30.     xticks = pd.date_range(start=citibike.index.min(), end=citibike.index.max(), freq='D')  
  31.     plt.xticks(range(0, len(X), 8), xticks.strftime("%a %m-%d"), rotation=90, ha="left")  
  32.   
  33.     plt.plot(range(n_train), y_train, label="train")  
  34.     plt.plot(range(n_train, len(y_test) + n_train), y_test, '-', label="test")  
  35.     plt.plot(range(n_train), y_pred_train, '--', label="prediction train")  
  36.   
  37.     plt.plot(range(n_train, len(y_test) + n_train), y_pred, '--',  
  38.              label="prediction test")  
  39.     plt.legend(loc=(1.010))  
  40.     plt.xlabel("Date")  
  41.     plt.ylabel("Rentals")  
  42.     plt.show()  
  43.   
  44. from sklearn.ensemble import RandomForestRegressor  
  45. regressor = RandomForestRegressor(n_estimators=100, random_state=0)  
  46. eval_on_features(X, y, regressor)  
Figure 5-13. Predictions made by a random forest using only the POSIX time 

The predictions on the training set are quite good, as is usual for random forests. However, for the test set, a constant line is predicted.The R2 is –0.03, which means that we learned nothing. What happened? The problem lies in the combination of our feature and the randomforest. The value of the POSIX time feature for the test set is outside of the range of the feature values in the training set: the points in the test set have timestamps that are later than all the points in the training set. Trees, and therefore random forests, cannot extrapolate to feature ranges outside the training set. The result is that the model simply predicts the target value of the closest point in the training set—which is the last time it observed any data. 


Clearly we can do better than this. This is where our “expert knowledge”comes in. From looking at the rental figures in the training data, two factors seem to be very important: the time of day and the day of the week. So, let’s add these two features. We can’t really learn anything from the POSIX time, so we drop that feature. First, let’s use only the hour of the day. As Figure 5-14 shows, now the predictions have the same pattern for each day of the week: 
- ch5_t16.py 
  1. ...  
  2. from sklearn.ensemble import RandomForestRegressor  
  3. regressor = RandomForestRegressor(n_estimators=100, random_state=0)  
  4. X_hour = np.array(citibike.index.hour).reshape(-11)  
  5. eval_on_features(X_hour, y, regressor)  
Figure 5-14. Predictions made by a random forest using only the hour of the day 

The R^2 is already much better, but the predictions clearly miss the weekly pattern. Now let’s also add the day of the week (see Figure 5-15): 
>>> from ch5_t14 import *
>>> X_hour = np.array(citibike.index.hour).reshape(-1, 1)
>>> X_week = np.array(citibike.index.dayofweek).reshape(-1,1)
>>> X_hour_week = np.hstack([X_hour, X_week])
>>> X_hour_week.shape // Total 248 records; each record has two features.
(248, 2)
>>> X_hour_week[:10] // Show top 10 records. Each record with features [hour, dayofweek]
array([[ 0, 5],
[ 3, 5],
[ 6, 5],
[ 9, 5],
[12, 5],
[15, 5],
[18, 5],
[21, 5],
[ 0, 6],
[ 3, 6]], dtype=int32)

- ch5_t17.py 
  1. ...  
  2. from sklearn.ensemble import RandomForestRegressor  
  3. regressor = RandomForestRegressor(n_estimators=100, random_state=0)  
  4. X_hour = np.array(citibike.index.hour).reshape(-11)  
  5. X_week = np.array(citibike.index.dayofweek).reshape(-11)  
  6. X_hour_week = np.hstack([X_hour, X_week])  
  7. eval_on_features(X_hour_week, y, regressor)  


Figure 5-15. Predictions with a random forest using day of week and hour of day as features 

Now we have a model that captures the periodic behavior by considering the day of week and time of day. It has an R^2 of 0.84, and shows pretty good predictive performance. What this model likely is learning is the mean number of rentals for each combination of weekday and time of day from the first 23 days of August. This actually does not require a complex model like a random forest, so let’s try with a simpler model, LinearRegression (see Figure 5-16): 
- ch5_t18.py 
  1. ...  
  2. X_hour = np.array(citibike.index.hour).reshape(-11)  
  3. X_week = np.array(citibike.index.dayofweek).reshape(-11)  
  4. X_hour_week = np.hstack([X_hour, X_week])  
  5. from sklearn.linear_model import LinearRegression  
  6. eval_on_features(X_hour_week, y, LinearRegression())  
Figure 5-16. Predictions made by linear regression using day of week and hour of day as features 

LinearRegression works much worse, and the periodic pattern looks odd.The reason for this is that we encoded day of week and time of day using integers, which are interpreted as continuous variables.Therefore, the linear model can only learn a linear function of the time of day—and it learned that later in the day, there are more rentals. However, the patterns are much more complex than that. We can capture this by interpreting the integers as categorical variables, by transforming them using OneHotEncoder (see Figure 5-17): 
- ch5_t19.py 
  1. ...  
  2. from sklearn.preprocessing import OneHotEncoder  
  3. X_hour = np.array(citibike.index.hour).reshape(-11)  
  4. X_week = np.array(citibike.index.dayofweek).reshape(-11)  
  5. X_hour_week = np.hstack([X_hour, X_week])  
  6. enc = OneHotEncoder()  
  7. X_hour_week_onehot = enc.fit_transform(X_hour_week).toarray()  
  8. from sklearn.linear_model import LinearRegression  
  9. eval_on_features(X_hour_week_onehot, y, LinearRegression())  
Figure 5-17. Predictions made by linear regression using a one-hot encoding of hour of day and day of week 

This gives us a much better match than the continuous feature encoding.Now the linear model learns one coefficient for each day of the week,and one coefficient for each time of the day. That means that the “time of day” pattern is shared over all days of the week, though. Using interaction features, we can allow the model to learn one coefficient for each combination of day and time of day (see Figure 5-18): 
- ch5_t20.py 
  1. ...  
  2. from sklearn.preprocessing import OneHotEncoder  
  3. from sklearn.preprocessing import PolynomialFeatures  
  4. X_hour = np.array(citibike.index.hour).reshape(-11)  
  5. X_week = np.array(citibike.index.dayofweek).reshape(-11)  
  6. X_hour_week = np.hstack([X_hour, X_week])  
  7. enc = OneHotEncoder()  
  8. poly_transformer = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)  
  9. X_hour_week_onehot = enc.fit_transform(X_hour_week).toarray()  
  10. X_hour_week_onehot_poly = poly_transformer.fit_transform(X_hour_week_onehot)  
  11.   
  12. from sklearn.linear_model import Ridge  
  13. lr = Ridge()  
  14. eval_on_features(X_hour_week_onehot_poly, y, lr)  
Figure 5-18. Predictions made by linear regression using a product of the day of week and hour of day features 

This transformation finally yields a model that performs similarly well to the random forest. A big benefit of this model is that it is very clear what is learned: one coefficient for each day and time. We can simply plot the coefficients learned by the model, something that would not be possible for the random forest.First, we create feature names for the hour and day features: 
  1. hour = ["%02d:00" % i for i in range(0243)]  
  2. day = ["Mon""Tue""Wed""Thu""Fri""Sat""Sun"]  
  3. features =  day + hour  
Then we name all the interaction features extracted by PolynomialFeatures, using the get_feature_names method, and keep only the features with nonzero coefficients: 
  1. features_poly = poly_transformer.get_feature_names(features)  
  2. features_nonzero = np.array(features_poly)[lr.coef_ != 0]  
  3. coef_nonzero = lr.coef_[lr.coef_ != 0]  
Now we can visualize the coefficients learned by the linear model, as seen in Figure 5-19: 
  1. import matplotlib.pyplot as plt  
  2. xticks = pd.date_range(start=citibike.index.min(), end=citibike.index.max(), freq='D')  
  3. plt.figure(figsize=(152))  
  4. plt.plot(coef_nonzero, 'o')  
  5. plt.xticks(np.arange(len(coef_nonzero)), features_nonzero, rotation=90)  
  6. plt.xlabel("Feature magnitude")  
  7. plt.ylabel("Feature")  
  8. plt.show()  



Figure 5-19. Coefficients of the linear regression model using a product of hour and day

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...