2018年3月4日 星期日

[ Scikit- learn ] FAQ - Passing categorical data to Sklearn Decision Tree

Source From Here 
Question 
There are several posts about how to encode categorical data to Sklearn Decission trees, but from Sklearn documentation, we got these: 
Able to handle both numerical and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.

But running the following script: 
  1. import pandas as pd  
  2. from sklearn.tree import DecisionTreeClassifier  
  3.   
  4. data = pd.DataFrame()  
  5. data['A'] = ['a','a','b','a']  
  6. data['B'] = ['b','b','a','b']  
  7. data['C'] = [0010]  
  8. data['Class'] = ['n','n','y','n']  
  9.   
  10. tree = DecisionTreeClassifier()  
  11. tree.fit(data[['A','B','C']], data['Class'])  
outputs the following error: 
...
ValueError: could not convert string to float: b

How-To 
Use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline. Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose. 

Refer to the following code from the documentation: 
  1. from sklearn import preprocessing  
  2.   
  3. le = preprocessing.LabelEncoder()  
  4. le.fit(["paris""paris""tokyo""amsterdam"])  
  5. le.transform(["tokyo""tokyo""paris"])   
This automatically encodes them into numbers for your machine learning algorithms. Now this also supports going back to strings from integers. You can do this by simply calling inverse_transform as follows: 
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit(["paris", "paris", "tokyo", "amsterdam"])
LabelEncoder()
>>> labels = le.transform(["tokyo", "tokyo", "paris"])
>>> print(labels)
[2 2 1]
>>> print(le.inverse_transform(labels))
['tokyo' 'tokyo' 'paris']

Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class. So the origin code can be rewritten as: 
  1. import pandas as pd  
  2. import numpy as np  
  3. from sklearn.tree import DecisionTreeClassifier  
  4. from sklearn import preprocessing  
  5.   
  6.   
  7. data = pd.DataFrame()  
  8. data['A'] = ['a','a','b','a']  
  9. data['B'] = ['b','b','a','b']  
  10. data['C'] = [0010]  
  11. data['Class'] = ['n','n','y','n']  
  12.   
  13. ndata = data.copy()  
  14.   
  15. # Transform category into label  
  16. a_le = preprocessing.LabelEncoder()  
  17. a_cate_set = set()  
  18. for e in data['A']:  
  19.     a_cate_set.add(e)  
  20. a_le.fit(list(a_cate_set))  
  21. ndata['A'] = a_le.transform(ndata['A'])  
  22. print("ndata['A']:\n{}\n".format(ndata['A']))  
  23.   
  24. # Transform category into label  
  25. b_le = preprocessing.LabelEncoder()  
  26. b_cate_set = set()  
  27. for e in data['B']:  
  28.     b_cate_set.add(e)  
  29. b_le.fit(list(b_cate_set))  
  30. ndata['B'] = b_le.transform(ndata['B'])  
  31. print("ndata['B']:\n{}\n".format(ndata['B']))  
  32.   
  33. # Traning  
  34. tree = DecisionTreeClassifier()  
  35. tree.fit(ndata[['A','B','C']], ndata['Class'])  
  36.   
  37. # Predicting  
  38. def transform(test_data):  
  39.     r'''  
  40.     Do the transformation for the input testing data  
  41.     '''  
  42.     n_test_data = test_data.copy()  
  43.     n_test_data[:,0] = a_le.transform(n_test_data[:,0])  
  44.     n_test_data[:,1] = b_le.transform(n_test_data[:,1])  
  45.     print('testing data:\n{}\n'.format(n_test_data))  
  46.     return n_test_data  
  47.   
  48. test_data = np.array([['b''a'1], ['a''b'0]])  
  49. print("Prediction: {}".format(tree.predict(transform(test_data))))  
Supplement 
FAQ - Assigning to columns in NumPy? 

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...