2018年3月29日 星期四

[ ML 文章收集 ] Converting categorical data into numbers with Pandas and Scikit-learn

Source From Here 
Preface 
Many machine learning tools will only accept numbers as input. This may be a problem if you want to use such tool but your data includes categorical features. To represent them as numbers typically one converts each categorical feature using “one-hot encoding”, that is from a value like “BMW” or “Mercedes” to a vector of zeros and one 1. 

This functionality is available in some software libraries. We load data using Pandas, then convert categorical columns with DictVectorizer from scikit-learn. Pandas is a popular Python library inspired by data frames in R. It allows easier manipulation of tabular numeric and non-numeric data. Downsides: not very intuitive, somewhat steep learning curve. For any questions you may have, Google + StackOverflow combo works well as a source of answers. 

Pandas has get_dummies() function which does what we’re after. The following code will replace categorical columns with their one-hot representations: 
>>> import pandas as pd
>>> df = pd.DataFrame(data={'name':['john', 'peter', 'ken'], 'age':[23,34,41]})
>>> cols_to_transform = ['name'] // Column name list to do transformation
>>> df_with_dummies = pd.get_dummies(df, cols_to_transform) // Transform column 'name' into one-hot encoding
>>> df_with_dummies
  1.    age  name_john  name_ken  name_peter  
  2. 0   23          1         0           0  
  3. 1   34          0         0           1  
  4. 2   41          0         1           0  

We’ll use Pandas to load the data, do some cleaning and send it to Scikit-learn’s DictVectorizerOneHotEncoder is another option. The difference is as follows: 
1. OneHotEncoder takes as input categorical values encoded as integers - you can get them from LabelEncoder
2. DictVectorizer expects data as a list of dictionaries, where each dictionary is a data row with column names as keys: 
  1. [ { 'foo'1'bar''z' },   
  2.   { 'foo'2'bar''a' },  
  3.   { 'foo'3'bar''c' } ]  
After vectorizing and saving as CSV it would look like this: 
  1. foo,bar=z,bar=a,bar=c  
  2. 1,1,0,0  
  3. 2,0,1,0  
  4. 3,0,0,1  
Notice the column names and that DictVectorizer doesn’t touch numeric values. 

The representation above is redundant, because to encode three values you need two indicator columns. In general, one needs d - 1 columns for d values. This is not a big deal, but apparently some methods will complain about collinearity. The solution is to drop one of the columns. It won’t result in information loss, because in the redundant scheme with d columns one of the indicators must be non-zero, so if two out of three are zeros then the third must be 1. And if one among the two is positive than the third must be zero. 

Pandas 
To convert some columns from a data frame to a list of dicts, we call df.to_dict( orient = 'records' )
>>> df = pd.DataFrame(data={'name':['john', 'peter', 'ken'], 'age':[23,34,41]})
>>> cols_to_retain = ['name']
>>> cat_dict = df[cols_to_retain].to_dict()
>>> cat_dict
{'name': {0: 'john', 1: 'peter', 2: 'ken'}}
>>> df[cols_to_retain].to_dict(orient='records')
[{'name': 'john'}, {'name': 'peter'}, {'name': 'ken'}]

If you have a few categorical columns, you can list them as above. In the Analytics Edge competition, there are about 100 categorical columns, so in this case it’s easier to drop columns which are not categorical: 
>>> cols_to_drop = ['name']
>>> cat_dict = df.drop(cols_to_drop, axis=1).to_dict(orient='records') // Column 'name' is dropped
>>> cat_dict
[{'age': 23}, {'age': 34}, {'age': 41}]
>>> df // Original dataframe won't be changed
  1.    age   name  
  2. 0   23   john  
  3. 1   34  peter  
  4. 2   41    ken  

Using the vectorizer 
>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> X // With columns ('bar', 'baz', 'foo')
array([[ 2., 0., 1.],
[ 0., 1., 3.]])

>>> v.inverse_transform(X) == [{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
True
>>> v.transform({'foo': 4, 'unseen_feature': 3}) // Unseen feature will be missing
array([[ 0., 0., 4.]])

If the data has missing values, they will become NaNs in the resulting Numpy arrays. Therefore it’s advisable to fill them in with Pandas first: 
  1. cat_data = cat_data_with_missing_values.fillna( 'NA' )  
This way, the vectorizer will create additional column =NA for each feature with NAs: 
>>> df = pd.DataFrame(data=[{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}])
>>> df
  1.    bar  baz  foo  
  2. 0  2.0  NaN    1  
  3. 1  NaN  1.0    3  

>>> v.fit_transform(df.to_dict(orient='records'))
  1. array([[  2.,  nan,   1.],  
  2.        [ nan,   1.,   3.]])  

>>> v.fit_transform(df.fillna('NA').to_dict(orient='records'))
  1. array([[ 2.,  0.,  0.,  1.,  1.],  
  2.        [ 0.,  1.,  1.,  0.,  3.]])  


Handling binary features with missing values 
If you have missing values in a binary feature, there’s an alternative representation: 
* -1 for negatives
* 0 for missing values
* 1 for positives

It worked better in case of the Analytics Edge competition: an SVM trained on one-hot encoded data with d indicators scored 0.768 in terms of AUC, while the alternative representation yielded 0.778. That simple solution would give you 30th place out of 1686 contenders.

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...