程式扎記: [ Scikit- learn ] FAQ - Passing categorical data to Sklearn Decision Tree

2018年3月4日星期日

[ Scikit- learn ] FAQ - Passing categorical data to Sklearn Decision Tree

Source From Here
Question
There are several posts about how to encode categorical data to Sklearn Decission trees, but from Sklearn documentation, we got these:

Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.

But running the following script:

view plaincopy to clipboardprint?
import pandas as pd  
from sklearn.tree import DecisionTreeClassifier  
  
data = pd.DataFrame()  
data['A'] = ['a','a','b','a']  
data['B'] = ['b','b','a','b']  
data['C'] = [0, 0, 1, 0]  
data['Class'] = ['n','n','y','n']  
  
tree = DecisionTreeClassifier()  
tree.fit(data[['A','B','C']], data['Class'])  

outputs the following error:

...
ValueError: could not convert string to float: b

How-To
Use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline. Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.

Refer to the following code from the documentation:

view plaincopy to clipboardprint?
from sklearn import preprocessing  
  
le = preprocessing.LabelEncoder()  
le.fit(["paris", "paris", "tokyo", "amsterdam"])  
le.transform(["tokyo", "tokyo", "paris"])   

This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows:

>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> labels = le.transform(["tokyo", "tokyo", "paris"])
>>> print(labels)
[2 2 1]
>>> print(le.inverse_transform(labels))
['tokyo' 'tokyo' 'paris']

Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class. So the origin code can be rewritten as:

view plaincopy to clipboardprint?
import pandas as pd  
import numpy as np  
from sklearn.tree import DecisionTreeClassifier  
from sklearn import preprocessing  
  
  
data = pd.DataFrame()  
data['A'] = ['a','a','b','a']  
data['B'] = ['b','b','a','b']  
data['C'] = [0, 0, 1, 0]  
data['Class'] = ['n','n','y','n']  
  
ndata = data.copy()  
  
# Transform category into label  
a_le = preprocessing.LabelEncoder()  
a_cate_set = set()  
for e in data['A']:  
    a_cate_set.add(e)  
a_le.fit(list(a_cate_set))  
ndata['A'] = a_le.transform(ndata['A'])  
print("ndata['A']:\n{}\n".format(ndata['A']))  
  
# Transform category into label  
b_le = preprocessing.LabelEncoder()  
b_cate_set = set()  
for e in data['B']:  
    b_cate_set.add(e)  
b_le.fit(list(b_cate_set))  
ndata['B'] = b_le.transform(ndata['B'])  
print("ndata['B']:\n{}\n".format(ndata['B']))  
  
# Traning  
tree = DecisionTreeClassifier()  
tree.fit(ndata[['A','B','C']], ndata['Class'])  
  
# Predicting  
def transform(test_data):  
    r'''  
    Do the transformation for the input testing data  
    '''  
    n_test_data = test_data.copy()  
    n_test_data[:,0] = a_le.transform(n_test_data[:,0])  
    n_test_data[:,1] = b_le.transform(n_test_data[:,1])  
    print('testing data:\n{}\n'.format(n_test_data))  
    return n_test_data  
  
test_data = np.array([['b', 'a', 1], ['a', 'b', 0]])  
print("Prediction: {}".format(tree.predict(transform(test_data))))  

Supplement
* FAQ - Assigning to columns in NumPy?

程式扎記

標籤

2018年3月4日星期日

[ Scikit- learn ] FAQ - Passing categorical data to Sklearn Decision Tree

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2018年3月4日 星期日

[ Scikit- learn ] FAQ - Passing categorical data to Sklearn Decision Tree

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2018年3月4日星期日