程式扎記: [ ML 文章收集 ] ColumnTransformer Meets Natural Language Processing

2019年4月26日星期五

[ ML 文章收集 ] ColumnTransformer Meets Natural Language Processing

Source From Here
Preface

How to combine several feature extraction mechanisms or transformations into a single transformer in a scikit-learn pipeline

There are several different methods to append or combine different feature types. One method is using scipy.sparse.hstack that stack sparse matrices horizontally. However, I will introduce one method, the new hot baby on the block: ColumnTransformer function from sklearn. If you would like to try it out, you will need to upgrade your Sklearn to 0.20.

The Data
An excellent data set for this demonstration is Mercari Price Suggestion Challenge that builds a machine learning model to automatically suggest the right product prices. The data can be found here:

view plaincopy to clipboardprint?
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  
from scipy import stats  
from scipy.stats import norm, skew  
from sklearn import preprocessing  
from sklearn.linear_model import Ridge  
from sklearn.metrics import mean_squared_error  
from sklearn.model_selection import train_test_split  
from sklearn.compose import ColumnTransformer  
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  
from sklearn.preprocessing import OneHotEncoder  
from sklearn.pipeline import make_pipeline  
from sklearn.pipeline import Pipeline  
  
pd.set_option('display.float_format', lambda x: '%.3f' % x)  
df = pd.read_csv('train.tsv', sep = '\t')  

Our data contains heterogeneous data types, they are numeric, categorical and text data. We want to use different pre-processing steps and transformations for those different types of columns. For example: we may want to one-hot encode (sklearn.preprocessing.OneHotEncoder) the categorical features and tfidfvectorize (sklearn.feature_extraction.text.TfidfVectorizer) the text features.

Here “Price” is the target feature that we will predict.

Date Pre-processing

Target feature — price

Firstly, let's remove price = 0 and explore its distribution:

view plaincopy to clipboardprint?
df = df.loc[df['price'] > 0]  
  
sns.distplot(df['price'], fit = norm);  
(mu, sigma) = norm.fit(df['price'])  
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))  
  
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')  
plt.ylabel('Frequency')  
plt.title('Price distribution')  
  
fig = plt.figure()  
res = stats.probplot(df['price'], plot=plt)  
plt.show();  

The target feature price is right skewed. As linear models like normally distributed data , we will transform price and make it more normally distributed.

view plaincopy to clipboardprint?
sns.distplot(np.log1p(df['price']), fit = norm);  
(mu, sigma) = norm.fit(np.log1p(df['price']))  
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))  
  
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],  
            loc='best')  
plt.ylabel('Frequency')  
plt.title('Log (Price+1) distribution')  
  
fig = plt.figure()  
res = stats.probplot(np.log1p(df['price']), plot=plt)  
plt.show();  

The shape of graph looks better. Let's replace the field price with log+1 inline:

view plaincopy to clipboardprint?
df["price"] = np.log1p(df["price"])  

Feature Engineering

* Fill missing “category_name” with “other” and convert “category_name” to category data type
* Fill missing “brand_name” with “unknown”.
* Determine the popular brands and set the rest as “other”.
* Fill missing “item_description” with “None”
* Convert “item_description_id” to category data type
* Convert “brand_name” to category data type.

Corresponding code: (fillna/astype/value_counts)

view plaincopy to clipboardprint?
NUM_BRANDS = 2500  
NAME_MIN_DF = 10  
MAX_FEAT_DESCP = 50000  
  
df["category_name"] = df["category_name"].fillna("Other").astype("category")  
df["brand_name"] = df["brand_name"].fillna("unknown")  
  
pop_brands = df["brand_name"].value_counts().index[:NUM_BRANDS]  
df.loc[~df["brand_name"].isin(pop_brands), "brand_name"] = "Other"  
  
df["item_description"] = df["item_description"].fillna("None")  
df["item_condition_id"] = df["item_condition_id"].astype("category")  
df["brand_name"] = df["brand_name"].astype("category")  

Our features and target:

view plaincopy to clipboardprint?
target = df.price.values  
features = df[['name', 'item_condition_id', 'category_name', 'brand_name', 'shipping', 'item_description']].copy()  

Split the data in training and test sets:

view plaincopy to clipboardprint?
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state=0)  

The following is how to apply ColumnTransformer. Surprisingly, it is very simple:

* Encode “item_condition_id” & “brand_name”.
* CountVectorizer (Convert a collection of text documents to a matrix of token counts) “category_name” & “name”
* TfidfVectorizer “item_description”
* We can keep the remaining “shipping” feature by setting remainder='passthrough'. The values are appended to the end of the transformation:

view plaincopy to clipboardprint?
preprocess = ColumnTransformer(  
    [('item_condition_category', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['item_condition_id']),  
     ('brand_name_category', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['brand_name']),  
     ('category_name_countvec', CountVectorizer(), 'category_name'),  
     ('name_countvec', CountVectorizer(min_df=NAME_MIN_DF), 'name'),  
     ('description_tfidf', TfidfVectorizer(max_features = MAX_FEAT_DESCP, stop_words = 'english', ngram_range=(1,3)), 'item_description')],  
    remainder='passthrough')  

Model & Evaluation
We will combine this preprocessing step based on the ColumnTransformer with a regression in a Pipeline to predict the price.

view plaincopy to clipboardprint?
model = make_pipeline(  
    preprocess,  
    Ridge(solver = "lsqr", fit_intercept=False))  
      
model.fit(X_train, y_train)  
  
y_train_pred = model.predict(X_train)  
y_pred = model.predict(X_test)  
  
train_rmse = np.sqrt(mean_squared_error(y_train_pred, y_train))  
test_rmse = np.sqrt(mean_squared_error(y_pred, y_test))  
print('Train RMSE: %.4f' % train_rmse)  
print('Test RMSE: %.4f' % test_rmse)  

Execution result:

Train RMSE: 0.4550
Test RMSE: 0.4682

Jupyter notebook can be found on Github. Enjoy the rest of the week!

Supplement
* A Data Science Workflow

程式扎記

標籤

2019年4月26日星期五

[ ML 文章收集 ] ColumnTransformer Meets Natural Language Processing

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2019年4月26日 星期五

[ ML 文章收集 ] ColumnTransformer Meets Natural Language Processing

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

2019年4月26日星期五