There are several different methods to append or combine different feature types. One method is using scipy.sparse.hstack that stack sparse matrices horizontally. However, I will introduce one method, the new hot baby on the block: ColumnTransformer function from sklearn. If you would like to try it out, you will need to upgrade your Sklearn to 0.20.
An excellent data set for this demonstration is Mercari Price Suggestion Challenge that builds a machine learning model to automatically suggest the right product prices. The data can be found here:
Our data contains heterogeneous data types, they are numeric, categorical and text data. We want to use different pre-processing steps and transformations for those different types of columns. For example: we may want to one-hot encode (sklearn.preprocessing.OneHotEncoder) the categorical features and tfidfvectorize (sklearn.feature_extraction.text.TfidfVectorizer) the text features.
Here “Price” is the target feature that we will predict.
Target feature — price
Firstly, let's remove price = 0 and explore its distribution:
The shape of graph looks better. Let's replace the field price with log+1 inline:
Corresponding code: (fillna/astype/value_counts)
We will combine this preprocessing step based on the ColumnTransformer with a regression in a Pipeline to predict the price.
Jupyter notebook can be found on Github. Enjoy the rest of the week!
* A Data Science Workflow