Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1.
This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers.
Pandas has get_dummies() function which does what we’re after. The following code will replace categorical columns with their one-hot representations:
We’ll use Pandas to load the data, do some cleaning and send it to Scikit-learn’s DictVectorizer. OneHotEncoder is another option. The difference is as follows:
1. OneHotEncoder takes as input categorical values encoded as integers - you can get them from LabelEncoder.
2. DictVectorizer expects data as a list of dictionaries, where each dictionary is a data row with column names as keys:
The representation above is redundant, because to encode three values you need two indicator columns. In general, one needs d - 1 columns for d values. This is not a big deal, but apparently some methods will complain about collinearity. The solution is to drop one of the columns. It won’t result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1. And if one among the two is positive than the third must be zero.
To convert some columns from a data frame to a list of dicts, we call df.to_dict( orient = 'records' ):
If you have a few categorical columns, you can list them as above. In the Analytics Edge competition, there are about 100 categorical columns, so in this case it’s easier to drop columns which are not categorical:
Using the vectorizer
If the data has missing values, they will become NaNs in the resulting Numpy arrays. Therefore it’s advisable to fill them in with Pandas first:
Handling binary features with missing values
If you have missing values in a binary feature, there’s an alternative representation:
It worked better in case of the Analytics Edge competition: an SVM trained on one-hot encoded data with d indicators scored 0.768 in terms of AUC, while the alternative representation yielded 0.778. That simple solution would give you 30th place out of 1686 contenders.