So far, we assumed that our data comes in as a two-dimensional array of floating point numbers, where each column is a continuous feature that describes the data points. For many applications, this is not how the data is collected. A particular common type of feature is categorical features, also known as discrete features, which are usually not numeric. The distinction between categorical feature and continuous feature is analogous to the distinction between classification and regression - only on the input side, not the output side.
Examples for continuous features that we saw are pixel brightnesses and size measurements of plant flowers. Examples for categorical features are the brand of a product, the color of a product, or the department (books, clothing, hardware) it is sold in. These are all properties that can describe a product, but they don’t vary in a continuous way. A product belongs either in the clothing department or in the books department. There is no middle ground between books and clothing, and no natural order for the different categories (books is not greater or smaller than clothing, hardware is not between books and clothing, etc).
Regardless of the type of features your data consists of, how you represent them can have an enormous effect on the performance of machine learning models. We saw in Chapter 2 and Chapter 3 that scaling of the data is important. In other words, if you don’t rescale your data (say, to unit variance), then it makes a difference whether you represent a measurement in centimeters or inches. We also saw in Chapter 2 that it can be helpful to augment your data with additional features, like adding interactions (products) of features or more general polynomials.
The question of how to represent your data best for a particular application is known as feature engineering, and it is one of the main tasks of data scientists and machine learning practitioners trying to solve real-world problems. Representing your data in the right way can have a bigger influence on the performance of a supervised model than the exact parameters you choose.
We will first go over the important and very common case of categorical features, and then give some examples of helpful transformations for specific combinations of features and models.
Categorical Variables
As an example, we will use the dataset of adult incomes in the United States, derived from the 1994 census database (Data Source). The task of the adult dataset is to predict whether a worker has an income of over 50.000$ or under 50.000$. The features in this dataset include the workers age, how they are employed (self employed, private industry employee, goverment employee, ...), their education, their gender, their working hours per week, occupation and more. Below is a table showing the first few entries in the data set:
The task is phrased as a classification task with the two classes being income <=50k and >50k. It would also be possible to predict the exact income, and make this a regression task. However, that would be much more difficult, and the 50K division is interesting to understand on its own. In this dataset, age and hours-per-week are continuous features, which we know how to treat. The workclass, education, sex and occupation features are categorical, however. All of them come from a fixed list of possible values, as opposed to a range, and denote a qualitative property, as opposed to a quantity.
One-Hot-Encoding (Dummy variables)
By far the most common way to represent categorical variables is using the one-hotencoding or one-out-of-N encoding, also known as dummy variables. The idea behind dummy variables is to replace a categorical variable with one or more new features that can have the values 0 and 1. The values 0 and 1 make sense in Formula (1) (and for all other models in scikit-learn), and we can represent any number of categories by introducing one new feature per category as follows.
Let’s say for the workclass feature we have possible values of "Government Employee", "Private Employee", "Self Employed" and "Self Employed Incorporated". To encode this four possible values, we create four new features, called "Govern mentEmployee", "Private Employee", "Self Employed" and "Self Employed Incorporated". The feature is 1 if workclass for this person has the corresponding value, and 0 otherwise. So exactly one of the four new features will be 1 for each data point. This is why this is called one-hot or one-out-of-N encoding.
The principle is illustrated here. A single feature is encoded using four new features. When using this data in a machine learning algorithm, we would drop the original workclass feature and only keep the 0-1 features:
There are two ways to convert your data to a one-hot encoding of categorical variables, either using pandas or using scikit-learn. At the time of writing, using pandas for this setting is slightly easier, so let’s go this route. First we load the data using pandas from a comma seperated values (CSV) file:
- ch5_t10.py
- #!/usr/bin/env python
- import pandas as pd
- # The file has no headers naming the columns, so we pass header=None and provide the column names
- data = pd.read_csv("adult.data", header=None, index_col=False,
- names=['age', 'workclass', 'fnlwgt', 'education', 'education-num',
- 'marital-status', 'occupation', 'relationship', 'race', 'gender',
- 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income'])
- # For illustration purposes, we only select some of the columns:
- data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']]
- # print the first 5 rows
- print data.head()
Checking string-encoded categorical data
After reading a dataset like this, it is often good to first check if a column actually contains meaningful categorical data. When working with data that was input by humans (say users on a website), there might not be a fixed set of categories, and differences in spelling and capitalization might require preprocessing. For example, it might be that some people specified gender as “male” and some as “man”, and we might want to represent these two inputs using the same category.
A good way to check the contents of a column is using the value_counts function of a pandas series (the type of a single column in a dataframe), to show us what the unique values are, and how often they appear:
We can see that there are exactly two values for gender in this datasets, Male and Female, meaning the data is already in a good format to be represented using one-hot-encoding. In a real application, you should look at all columns, and check their values. We will skip this here for brevity’s sake. There is a very simple way to encode the data in pandas, using the get_dummies function:
You can see that the continuous features age and hours-per-week were not touched, while the categorical features were explained into one new feature for each possible value:
We can now use the values attribute to convert the data_dummies DataFrame into a NumPy array, and then train a machine learning model on it. Be careful to separate the target variable (which is now encoded into two income columns) from the data before training a model. Including the output variable, or some derived property of the output variable, into the feature representation is a very common mistake in building supervised machine learning model.
WARNING
In this case, we extract only the columns containing features - that is, all columns from age to occupation_Transport-moving. This range contains all the features but not the target:
- features = data_dummies.ix[:, 'age':'occupation_ Transport-moving']
- # Extract NumPy arrays
- X = features.values
- y = data_dummies['income_ >50K'].values
- print("X.shape: {}; y.shape: {}".format(X.shape, y.shape))
- from sklearn.linear_model import LogisticRegression
- from sklearn.model_selection import train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
- logreg = LogisticRegression()
- logreg.fit(X_train, y_train)
- print("Test socre: {:.02f}".format(logreg.score(X_test, y_test)))
Numbers Can Encode Categories
In the example of the adult dataset, the categorical variables were encoded as strings. On the one hand, that opens up the possibility of spelling errors, but one the other hand, it clearly makes a variable as categorical. Often, whether for ease of storage or because of the way the data is collected, categorical variables are encoded as integers. For example, imagine the census data in the adult dataset was collected using a questionnaire, and the answers for workclass were recorded as 0 (first box ticked), 1(second box ticked), 2(third box ticked), and so on. Now the column will contain numbers from 0 to 8, instead of strings like "Private", and it won't be immediately obvious to someone looking at the table representing the dataset whether they should treat this variable as continuous or categorical. Knowing that the numbers indicate employment status, however, it is clear that these are very distinct states and should not be modeled by a single continuous variable.
WARNING
The get_dummies function in pandas treats all numbers as continuous and will not create dummy variables for them. To get around this, you can either use scikit-learn's OneHotEncoder, for which you can specify which variables are continuous and which are discrete, or convert numeric columns in the DataFrame to strings. To illustrate, let's create a DataFrame object with two columns, one containing strings and one containing integers:
- ch5_t02.py
- # Create a DataFrame with an integer feature and a categorical string feature
- demo_df = pd.DataFrame({'Integer Feature':[0, 1, 2, 1], 'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
- print demo_df
Using get_dummies will only encode the string feature and will not change the integer feature, as you can see below:
If you want dummy variables to be created for the "Integer Feature" column, you can explicitly list the columns you want to encode using the columns parameter. Then, both features will be treated as categorical:
Binning, Discretization, Linear Models, and Trees
The best way to represent data depends not only on the semantics of the data, but also on the kind of model you are using. Linear models and tree-based models (such as decision trees, gradient boosted trees, and random forests), two large and very commonly used families, have very different properties when it comes to how they work with different feature representations. Let’s go back to the wave regression dataset that we used in Chapter 2. It has only a single input feature. Here is a comparison of a linear regression model and a decision tree regressor on this dataset (see Figure 4-1):
- ch5_t03.py
- #!/usr/bin/env python
- import numpy as np
- import mglearn
- from sklearn.linear_model import LinearRegression
- from sklearn.tree import DecisionTreeRegressor
- X, y = mglearn.datasets.make_wave(n_samples=100)
- line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
- reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
- import matplotlib.pyplot as plt
- plt.plot(line, reg.predict(line), label="decision tree")
- reg = LinearRegression().fit(X, y)
- plt.plot(line, reg.predict(line), label="linear regression")
- plt.plot(X[:, 0], y, 'o', c='k')
- plt.ylabel("Regression output")
- plt.xlabel("Input feature")
- plt.legend(loc="best")
- plt.show()
As you know, linear models can only model linear relationships, which are lines in the case of a single feature. The decision tree can build a much more complex model of the data. However, this is strongly dependent on the representation of the data. One way to make linear models more powerful on continuous data is to use binning (also known as discretization) of the feature to split it up into multiple features,as described here.
We imagine a partition of the input range for the feature (in this case,the numbers from –3 to 3) into a fixed number of bins—say, 10. A datapoint will then be represented by which bin it falls into. To determine this,we first have to define the bins. In this case, we’ll define 10 bins equally spaced between –3 and 3. We use the np.linspace function for this, creating 11 entries, which will create 10 bins—they are the spaces in between two consecutive boundaries:
Here, the first bin contains all data points with feature values -3 to -2.4, the second bin contains all points with feature values from -2.4 to -1.8, and so on. Next, we record for each data point which bin it falls into. This can be easily computed using the np.digitize function as below:
- bins = np.linspace(-3, 3, 11)
- which_bin = np.digitize(X, bins=bins)
- print "\nData points:\n%s\n" % X[:5]
- print "\nBin membership for data points:\n%s\n" % which_bin[:5]
What we did here is transform the single continuous input feature in the wave dataset into a categorical feature that encodes which bin a data point is in. To use a scikit-learn model on this data, we transform this discrete feature to one-hot encoding using the OneHotEncoder from preprocessing module. The OneHotEncoder does the same encoding as pandas.get_dummies, through it currently only works on categorical variables that are integers:
- from sklearn.preprocessing import OneHotEncoder
- # Transform using the OneHotEncoder
- encoder = OneHotEncoder(sparse=False)
- # encoder.fit finds the unique values that appear in what_bin
- encoder.fit(which_bin)
- # Transform creates the one-hot encoding
- X_binned = encoder.transform(which_bin)
- print "Top 5 rows in X_binned:\n%s\n" % X_binned[:5]
Because we specified 10 bins, the transformed dataset X_binned now is made up of 10 features:
- print "X_binned.shape = %s\n" % str(X_binned.shape)
Now we build a new linear regression model and a new decision tree model on the one-hot-encoded data. The result is visualized in Figure 4-2, together with the bin boundaries, shown as dotted block lines:
- ch5_t04.py
- #!/usr/bin/env python
- import numpy as np
- import mglearn
- from sklearn.linear_model import LinearRegression
- from sklearn.tree import DecisionTreeRegressor
- X, y = mglearn.datasets.make_wave(n_samples=100)
- line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
- reg = DecisionTreeRegressor(min_samples_split=3).fit(X, y)
- bins = np.linspace(-3, 3, 11)
- print "bins: {}".format(bins)
- which_bin = np.digitize(X, bins=bins)
- print "\nData points:\n%s\n" % X[:5]
- print "\nBin membership for data points:\n%s\n" % which_bin[:5]
- from sklearn.preprocessing import OneHotEncoder
- # Transform using the OneHotEncoder
- encoder = OneHotEncoder(sparse=False)
- # encoder.fit finds the unique values that appear in what_bin
- encoder.fit(which_bin)
- # Transform creates the one-hot encoding
- X_binned = encoder.transform(which_bin)
- print "Top 5 rows in X_binned:\n%s\n" % X_binned[:5]
- print "X_binned.shape = %s\n" % str(X_binned.shape)
- line_binned = encoder.transform(np.digitize(line, bins=bins))
- reg = LinearRegression().fit(X_binned, y)
- import matplotlib.pyplot as plt
- plt.plot(line, reg.predict(line_binned), label='linear regression binned')
- reg = DecisionTreeRegressor(min_samples_split=3).fit(X_binned, y)
- plt.plot(line, reg.predict(line_binned), label='decision tree binned')
- plt.plot(X[:, 0], y, 'o', c='k')
- plt.vlines(bins, -3, 3, linewidth=1, alpha=.2)
- plt.legend(loc="best")
- plt.ylabel("Regression output")
- plt.xlabel("Input feature")
- plt.show()
The dashed line and solid line are exactly on top of each other, meaning the linear regression model and the decision tree make exactly the same prediction. For each bin, they predict a constant value. As feature are constant within each bin, any model must predict the same value for all points within a bin. Comparing what the models learned before binning the feature and after, we see that the linear model become much more flexible, because it now has a different value for each bin, while the decision tree model got much less flexible. Binning features generally has no beneficial effort for tree-based model, as those models can learn to split up the data anywhere. In a sense, that means decision tree can learn whatever binning is most useful for predicting on this data. Additionally, decision tree look at multiple features at once, while binning is usually done on a per-feature basis. However, the linear model benefited greatly in expressiveness from the transformation of the data. If there are good reasons to use a linear model for a particular dataset - say, because it is very large and high dimensional, but some features have nonlinear relations with the output - binning can be a great way to increase modeling power.
沒有留言:
張貼留言