Question
There are several posts about how to encode categorical data to Sklearn Decission trees, but from Sklearn documentation, we got these:
But running the following script:
- import pandas as pd
- from sklearn.tree import DecisionTreeClassifier
- data = pd.DataFrame()
- data['A'] = ['a','a','b','a']
- data['B'] = ['b','b','a','b']
- data['C'] = [0, 0, 1, 0]
- data['Class'] = ['n','n','y','n']
- tree = DecisionTreeClassifier()
- tree.fit(data[['A','B','C']], data['Class'])
How-To
Use tools provided by Scikit-Learn for this purpose. The main reason for doing so is that they can be easily integrated in a Pipeline. Scikit-Learn itself provides very good classes to handle categorical data. Instead of writing your custom function, you should use LabelEncoder which is specially designed for this purpose.
Refer to the following code from the documentation:
- from sklearn import preprocessing
- le = preprocessing.LabelEncoder()
- le.fit(["paris", "paris", "tokyo", "amsterdam"])
- le.transform(["tokyo", "tokyo", "paris"])
Also note that for many other classifiers, apart from decision trees, such as logistic regression or SVM, you would like to encode your categorical variables using One-Hot encoding. Scikit-learn supports this as well through the OneHotEncoder class. So the origin code can be rewritten as:
- import pandas as pd
- import numpy as np
- from sklearn.tree import DecisionTreeClassifier
- from sklearn import preprocessing
- data = pd.DataFrame()
- data['A'] = ['a','a','b','a']
- data['B'] = ['b','b','a','b']
- data['C'] = [0, 0, 1, 0]
- data['Class'] = ['n','n','y','n']
- ndata = data.copy()
- # Transform category into label
- a_le = preprocessing.LabelEncoder()
- a_cate_set = set()
- for e in data['A']:
- a_cate_set.add(e)
- a_le.fit(list(a_cate_set))
- ndata['A'] = a_le.transform(ndata['A'])
- print("ndata['A']:\n{}\n".format(ndata['A']))
- # Transform category into label
- b_le = preprocessing.LabelEncoder()
- b_cate_set = set()
- for e in data['B']:
- b_cate_set.add(e)
- b_le.fit(list(b_cate_set))
- ndata['B'] = b_le.transform(ndata['B'])
- print("ndata['B']:\n{}\n".format(ndata['B']))
- # Traning
- tree = DecisionTreeClassifier()
- tree.fit(ndata[['A','B','C']], ndata['Class'])
- # Predicting
- def transform(test_data):
- r'''
- Do the transformation for the input testing data
- '''
- n_test_data = test_data.copy()
- n_test_data[:,0] = a_le.transform(n_test_data[:,0])
- n_test_data[:,1] = b_le.transform(n_test_data[:,1])
- print('testing data:\n{}\n'.format(n_test_data))
- return n_test_data
- test_data = np.array([['b', 'a', 1], ['a', 'b', 0]])
- print("Prediction: {}".format(tree.predict(transform(test_data))))
* FAQ - Assigning to columns in NumPy?
沒有留言:
張貼留言