程式扎記: [ ML 文章收集 ] Feature Importance and Feature Selection With XGBoost in Python

Preface

(article source) A benefit of using ensembles of decision tree methods like gradient boosting is that they can automatically provide estimates of feature importance from a trained predictive model.

In this post you will discover how you can estimate the importance of features for a predictive modeling problem using the XGBoost library in Python. After reading this post you will know:

* How feature importance is calculated using the gradient boosting algorithm.
* How to plot feature importance in Python calculated by the XGBoost model.
* How to use feature importance calculated by XGBoost to perform feature selection

For the follow-up sample code to work, you need to import the below packages:

view plaincopy to clipboardprint?
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
from numpy import sort  
from numpy import loadtxt  
from xgboost import XGBClassifier  
from matplotlib import pyplot  
from xgboost import plot_importance  
from kutils.analysis import fiplot  
from sklearn.metrics import accuracy_score  
from sklearn.model_selection import train_test_split  
from sklearn.feature_selection import SelectFromModel  

Feature Importance in Gradient Boosting
A benefit of using gradient boosting is that after the boosted trees are constructed, it is relatively straightforward to retrieve importance scores for each attribute.

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.

This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity (Gini index) used to select the split points or another more specific error function.

The feature importances are then averaged across all of the the decision trees within the model.

For more technical information on how feature importance is calculated in boosted decision trees, see Section 10.13.1 “Relative Importance of Predictor Variables” of the book The Elements of Statistical Learning: Data Mining, Inference, and Prediction, page 367.

Also, see Matthew Drury answer to the StackOverflow question “Relative variable importance for Boosting” where he provides a very detailed and practical answer.

Manually Plot Feature Importance
A trained XGBoost model automatically calculates feature importance on your predictive modeling problem.

These importance scores are available in the feature_importances_ member variable of the trained model. For example, they can be printed directly as follows:

view plaincopy to clipboardprint?
print(model.feature_importances_)  

We can plot these scores on a bar chart directly to get a visual indication of the relative importance of each feature in the dataset. For example:

view plaincopy to clipboardprint?
# plot  
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)  
pyplot.show()  

We can demonstrate this by training an XGBoost model on the Pima Indians onset of diabetes dataset and creating a bar chart from the calculated feature importances. Firstly, let's download the dataset and place it in your current working directory.

* Dataset File.
* Dataset Details.

Below are feature names:

* Number of times pregnant
* Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* Diastolic blood pressure (mm Hg)
* Triceps skin fold thickness (mm)
* 2-Hour serum insulin (mu U/ml)
* Body mass index (weight in kg/(height in m)^2)
* Diabetes pedigree function
* Age (years)
* Class variable (0 or 1)

Below sample code will train a model and display the feature importance coming with the trained model:

view plaincopy to clipboardprint?
# load data  
dataset = loadtxt('../../datas/pima-indians-diabetes.data.csv', delimiter=",")  
# split data into X and y  
X = dataset[:,0:8]  
y = dataset[:,8]  
  
# fit model with training data  
model = XGBClassifier(eval_metric='logloss', use_label_encoder=False)  
model.fit(X, y)  
  
# feature importance  
print(model.feature_importances_)  
  
# plot  
plt.rcParams['figure.figsize'] = [7, 5]  
pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_)  
pyplot.show()  

A downside of this plot is that the features are ordered by their input index rather than their importance. We could sort the features before plotting. Thankfully, there is a built in plot function to help us.

Using theBuilt-in XGBoost Feature Importance Plot
The XGBoost library provides a built-in function to plot features ordered by their importance.

The function is called plot_importance() and can be used as follows:

view plaincopy to clipboardprint?
plot_importance(model)  
pyplot.show()  

You can see that features are automatically named according to their index in the input array (X) from F0 to F7.

Manually mapping these indices to names in the problem description, we can see that the plot shows F6 has the highest importance and F4 has the lowest importance. However, it is not clear enought.

Below is another example by using package kutils_analysis to draw the feature importance:

view plaincopy to clipboardprint?
df = pd.DataFrame(  
    X,   
    columns=[  
        'number_of_times_pregnant',  
        'plasma_glucose',  
        'diastolic_blood_pressure',  
        'triceps_skin_fold_thickness',  
        'serum_insulin',  
        'body_mass_inde',  
        'diabetes_pedigree_function',  
        'age'  
    ]  
)  
  
futils = fiplot.Utils(df, y)  
ax = futils.treelike_fi(model, ytick_fontsize=15)  

Feature Selection with XGBoost Feature Importance Scores
Feature importance scores can be used for feature selection in scikit-learn.

This is done using the SelectFromModel class that takes a model and can transform a dataset into a subset with selected features.

This class can take a pre-trained model, such as one trained on the entire training dataset. It can then use a threshold to decide which features to select. This threshold is used when you call the transform() method on the SelectFromModel instance to consistently select the same features on the training dataset and the test dataset.

In the example below we first train and then evaluate an XGBoost model on the entire training dataset and test datasets respectively.

Using the feature importances calculated from the training dataset, we then wrap the model in a SelectFromModel instance. We use this to select features on the training dataset, train a model from the selected subset of features, then evaluate the model on the test set, subject to the same feature selection scheme.

For example:

view plaincopy to clipboardprint?
# select features using threshold  
selection = SelectFromModel(model, threshold=thresh, prefit=True)  
select_X_train = selection.transform(X_train)  
# train model  
selection_model = XGBClassifier()  
selection_model.fit(select_X_train, y_train)  
# eval model  
select_X_test = selection.transform(X_test)  
y_pred = selection_model.predict(select_X_test)  

For interest, we can test multiple thresholds for selecting features by feature importance. Specifically, the feature importance of each input variable, essentially allowing us to test each subset of features by importance, starting with all features and ending with a subset with the most important feature.

The complete code listing is provided below:

view plaincopy to clipboardprint?
# split data into train and test sets  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)  
  
# fit model with training data  
model = XGBClassifier()  
model.fit(X_train, y_train)  
  
# make predictions for test data and evaluate  
y_pred = model.predict(X_test)  
predictions = [round(value) for value in y_pred]  
accuracy = accuracy_score(y_test, predictions)  
print("Accuracy: %.2f%%" % (accuracy * 100.0))  

Output:

Accuracy: 74.03%

We can obtain accuracy=74% by training the model with all features. Let's see how feature selection can help us to improve the model:

view plaincopy to clipboardprint?
model.feature_importances_  

Output:

view plaincopy to clipboardprint?
array([0.09731667, 0.23725505, 0.1002797 , 0.09353314, 0.10012697,  
       0.16409664, 0.0994484 , 0.1079434 ], dtype=float32)  

Now let's collect the accuracy during the feature selection process:

view plaincopy to clipboardprint?
# Fit model using each importance as a threshold  
thresholds = sort(model.feature_importances_)  
score_list = []  
for thresh in thresholds:  
    # select features using threshold  
    selection = SelectFromModel(model, threshold=thresh, prefit=True)  
    select_X_train = selection.transform(X_train)  
      
    # train model  
    selection_model = XGBClassifier()  
    selection_model.fit(select_X_train, y_train)  
  
    # eval model  
    select_X_test = selection.transform(X_test)  
    y_pred = selection_model.predict(select_X_test)  
    predictions = [round(value) for value in y_pred]  
    accuracy = accuracy_score(y_test, predictions)  
    score_list.append(accuracy)  
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))  
  
score_df = pd.DataFrame(  
    list(map(lambda t: [t[0]*100, t[1]], zip(score_list, range(8, 0, -1)))) ,   
    columns=['accuracy', 'number_of_feature']  
)  
score_df  

view plaincopy to clipboardprint?
plt.rcParams['figure.figsize'] = [7, 5]  
ax = score_df.plot.bar(x='number_of_feature', y='accuracy', rot=0)  

We can see that the performance of the model generally decreases with the number of selected features.

On this problem there is a trade-off of features to test set accuracy and we could decide to take a less complex model (fewer attributes such as n=5) and accept a modest decrease in estimated accuracy from 74.02% down to 73.37%.

This is likely to be a inaccurate example and less useful on such a small dataset, but may be a more useful strategy on a larger dataset and using cross validation as the model evaluation scheme.

The notebook can be found here.

程式扎記

標籤

2021年3月15日星期一

[ ML 文章收集 ] Feature Importance and Feature Selection With XGBoost in Python

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2021年3月15日 星期一