程式扎記: [ ML 文章收集 ] Converting categorical data into numbers with Pandas and Scikit-learn

2018年3月29日星期四

[ ML 文章收集 ] Converting categorical data into numbers with Pandas and Scikit-learn

Source From Here
Preface
Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1.

This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers.

Pandas has get_dummies() function which does what we’re after. The following code will replace categorical columns with their one-hot representations:

>>> import pandas as pd
>>> df = pd.DataFrame(data={'name':['john', 'peter', 'ken'], 'age':[23,34,41]})
>>> cols_to_transform = ['name'] // Column name list to do transformation
>>> df_with_dummies = pd.get_dummies(df, cols_to_transform) // Transform column 'name' into one-hot encoding
>>> df_with_dummies

view plain copy to clipboard print ?

   age  name_john  name_ken  name_peter

0   23          1         0           0

1   34          0         0           1

2   41          0         1           0

We’ll use Pandas to load the data, do some cleaning and send it to Scikit-learn’s DictVectorizer. OneHotEncoder is another option. The difference is as follows:
1. OneHotEncoder takes as input categorical values encoded as integers - you can get them from LabelEncoder.
2. DictVectorizer expects data as a list of dictionaries, where each dictionary is a data row with column names as keys:

view plaincopy to clipboardprint?
[ { 'foo': 1, 'bar': 'z' },   
  { 'foo': 2, 'bar': 'a' },  
  { 'foo': 3, 'bar': 'c' } ]  

After vectorizing and saving as CSV it would look like this:

view plaincopy to clipboardprint?
foo,bar=z,bar=a,bar=c  
1,1,0,0  
2,0,1,0  
3,0,0,1  

Notice the column names and that DictVectorizer doesn’t touch numeric values.

The representation above is redundant, because to encode three values you need two indicator columns. In general, one needs d - 1 columns for d values. This is not a big deal, but apparently some methods will complain about collinearity. The solution is to drop one of the columns. It won’t result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1. And if one among the two is positive than the third must be zero.

Pandas
To convert some columns from a data frame to a list of dicts, we call df.to_dict( orient = 'records' ):

>>> df = pd.DataFrame(data={'name':['john', 'peter', 'ken'], 'age':[23,34,41]})
>>> cols_to_retain = ['name']
>>> cat_dict = df[cols_to_retain].to_dict()
>>> cat_dict
{'name': {0: 'john', 1: 'peter', 2: 'ken'}}
>>> df[cols_to_retain].to_dict(orient='records')
[{'name': 'john'}, {'name': 'peter'}, {'name': 'ken'}]

If you have a few categorical columns, you can list them as above. In the Analytics Edge competition, there are about 100 categorical columns, so in this case it’s easier to drop columns which are not categorical:

>>> cols_to_drop = ['name']
>>> cat_dict = df.drop(cols_to_drop, axis=1).to_dict(orient='records') // Column 'name' is dropped
>>> cat_dict
[{'age': 23}, {'age': 34}, {'age': 41}]
>>> df // Original dataframe won't be changed

view plain copy to clipboard print ?

   age   name

0   23   john

1   34  peter

2   41    ken

Using the vectorizer

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> X // With columns ('bar', 'baz', 'foo')
array([[ 2., 0., 1.],
[ 0., 1., 3.]])
>>> v.inverse_transform(X) == [{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
True
>>> v.transform({'foo': 4, 'unseen_feature': 3}) // Unseen feature will be missing
array([[ 0., 0., 4.]])

If the data has missing values, they will become NaNs in the resulting Numpy arrays. Therefore it’s advisable to fill them in with Pandas first:

view plaincopy to clipboardprint?
cat_data = cat_data_with_missing_values.fillna( 'NA' )  

This way, the vectorizer will create additional column =NA for each feature with NAs:

>>> df = pd.DataFrame(data=[{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}])
>>> df

view plain copy to clipboard print ?

   bar  baz  foo

0  2.0  NaN    1

1  NaN  1.0    3

>>> v.fit_transform(df.to_dict(orient='records'))

view plain copy to clipboard print ?

array([[  2.,  nan,   1.],

       [ nan,   1.,   3.]])

>>> v.fit_transform(df.fillna('NA').to_dict(orient='records'))

view plain copy to clipboard print ?

array([[ 2.,  0.,  0.,  1.,  1.],

       [ 0.,  1.,  1.,  0.,  3.]])

Handling binary features with missing values
If you have missing values in a binary feature, there’s an alternative representation:

* -1 for negatives
* 0 for missing values
* 1 for positives

It worked better in case of the Analytics Edge competition: an SVM trained on one-hot encoded data with d indicators scored 0.768 in terms of AUC, while the alternative representation yielded 0.778. That simple solution would give you 30th place out of 1686 contenders.

程式扎記

標籤

2018年3月29日星期四

[ ML 文章收集 ] Converting categorical data into numbers with Pandas and Scikit-learn

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2018年3月29日 星期四

[ ML 文章收集 ] Converting categorical data into numbers with Pandas and Scikit-learn

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2018年3月29日星期四