Preface
There are several different methods to append or combine different feature types. One method is using scipy.sparse.hstack that stack sparse matrices horizontally. However, I will introduce one method, the new hot baby on the block: ColumnTransformer function from sklearn. If you would like to try it out, you will need to upgrade your Sklearn to 0.20.
The Data
An excellent data set for this demonstration is Mercari Price Suggestion Challenge that builds a machine learning model to automatically suggest the right product prices. The data can be found here:
- import pandas as pd
- import numpy as np
- import matplotlib.pyplot as plt
- import seaborn as sns
- from scipy import stats
- from scipy.stats import norm, skew
- from sklearn import preprocessing
- from sklearn.linear_model import Ridge
- from sklearn.metrics import mean_squared_error
- from sklearn.model_selection import train_test_split
- from sklearn.compose import ColumnTransformer
- from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
- from sklearn.preprocessing import OneHotEncoder
- from sklearn.pipeline import make_pipeline
- from sklearn.pipeline import Pipeline
- pd.set_option('display.float_format', lambda x: '%.3f' % x)
- df = pd.read_csv('train.tsv', sep = '\t')
Our data contains heterogeneous data types, they are numeric, categorical and text data. We want to use different pre-processing steps and transformations for those different types of columns. For example: we may want to one-hot encode (sklearn.preprocessing.OneHotEncoder) the categorical features and tfidfvectorize (sklearn.feature_extraction.text.TfidfVectorizer) the text features.
Here “Price” is the target feature that we will predict.
Date Pre-processing
Target feature — price
Firstly, let's remove price = 0 and explore its distribution:
- df = df.loc[df['price'] > 0]
- sns.distplot(df['price'], fit = norm);
- (mu, sigma) = norm.fit(df['price'])
- print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
- plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
- plt.ylabel('Frequency')
- plt.title('Price distribution')
- fig = plt.figure()
- res = stats.probplot(df['price'], plot=plt)
- plt.show();
- sns.distplot(np.log1p(df['price']), fit = norm);
- (mu, sigma) = norm.fit(np.log1p(df['price']))
- print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))
- plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
- loc='best')
- plt.ylabel('Frequency')
- plt.title('Log (Price+1) distribution')
- fig = plt.figure()
- res = stats.probplot(np.log1p(df['price']), plot=plt)
- plt.show();
The shape of graph looks better. Let's replace the field price with log+1 inline:
- df["price"] = np.log1p(df["price"])
Corresponding code: (fillna/astype/value_counts)
- NUM_BRANDS = 2500
- NAME_MIN_DF = 10
- MAX_FEAT_DESCP = 50000
- df["category_name"] = df["category_name"].fillna("Other").astype("category")
- df["brand_name"] = df["brand_name"].fillna("unknown")
- pop_brands = df["brand_name"].value_counts().index[:NUM_BRANDS]
- df.loc[~df["brand_name"].isin(pop_brands), "brand_name"] = "Other"
- df["item_description"] = df["item_description"].fillna("None")
- df["item_condition_id"] = df["item_condition_id"].astype("category")
- df["brand_name"] = df["brand_name"].astype("category")
- target = df.price.values
- features = df[['name', 'item_condition_id', 'category_name', 'brand_name', 'shipping', 'item_description']].copy()
- X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state=0)
- preprocess = ColumnTransformer(
- [('item_condition_category', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['item_condition_id']),
- ('brand_name_category', OneHotEncoder(dtype='int', handle_unknown='ignore'), ['brand_name']),
- ('category_name_countvec', CountVectorizer(), 'category_name'),
- ('name_countvec', CountVectorizer(min_df=NAME_MIN_DF), 'name'),
- ('description_tfidf', TfidfVectorizer(max_features = MAX_FEAT_DESCP, stop_words = 'english', ngram_range=(1,3)), 'item_description')],
- remainder='passthrough')
We will combine this preprocessing step based on the ColumnTransformer with a regression in a Pipeline to predict the price.
- model = make_pipeline(
- preprocess,
- Ridge(solver = "lsqr", fit_intercept=False))
- model.fit(X_train, y_train)
- y_train_pred = model.predict(X_train)
- y_pred = model.predict(X_test)
- train_rmse = np.sqrt(mean_squared_error(y_train_pred, y_train))
- test_rmse = np.sqrt(mean_squared_error(y_pred, y_test))
- print('Train RMSE: %.4f' % train_rmse)
- print('Test RMSE: %.4f' % test_rmse)
Jupyter notebook can be found on Github. Enjoy the rest of the week!
Supplement
* A Data Science Workflow
沒有留言:
張貼留言