程式扎記

This message was edited 3 times. Last update was at 12/09/2020 15:17:04

Question
I want to apply scaling (using StandardScaler) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want:

view plaincopy to clipboardprint?
features = df[["col1", "col2", "col3", "col4"]]  
autoscaler = StandardScaler()  
features = autoscaler.fit_transform(features)  

A "solution" I found online is:

view plaincopy to clipboardprint?
features = features.apply(lambda x: autoscaler.fit_transform(x))  

How do I apply scaling to the pandas dataframe, leaving the dataframe intact? Without copying the data if possible.

How-To
You could convert the numpy array to as DataFrame as below:

>>> import pandas as pd
>>> import numpy as np
>>> datas = np.random.randint(0.0, 100, size=(10,4))
>>> datas
view plain copy to clipboard print ?
array([[48, 66, 39, 43],
       [30, 38, 22, 55],
       [93, 71, 25, 42],
       [26, 94, 31, 42],
       [97, 14, 10, 76],
       [27, 90, 83,  4],
       [48, 40, 13, 81],
       [48, 79, 19, 51],
       [13,  2, 27, 19],
       [75, 49, 58, 89]])

>>> df = pd.DataFrame(datas, index=range(10, 20), columns=['c1', 'c2', 'c3', 'c4'], dtype='float64')
>>> df
view plain copy to clipboard print ?
      c1    c2    c3    c4
10  48.0  66.0  39.0  43.0
11  30.0  38.0  22.0  55.0
12  93.0  71.0  25.0  42.0
13  26.0  94.0  31.0  42.0
14  97.0  14.0  10.0  76.0
15  27.0  90.0  83.0   4.0
16  48.0  40.0  13.0  81.0
17  48.0  79.0  19.0  51.0
18  13.0   2.0  27.0  19.0
19  75.0  49.0  58.0  89.0

Now fit_transform the DataFrame to get the scaled_features array:

view plaincopy to clipboardprint?
from sklearn.preprocessing import StandardScaler  
scaled_features = StandardScaler().fit_transform(df.values)  
print(scaled_features[:3,:] #lost the indices)  

Will get:

view plaincopy to clipboardprint?
array([[-1.89007341,  0.05636005,  1.74514417,  0.46669562],  
       [ 1.26558518, -1.35264122,  0.82178747,  0.59282958],  
       [ 0.93341059,  0.37841748, -0.60941542,  0.59282958]])  

Assign the scaled data to a DataFrame (Note: use the index and columns keyword arguments to keep your original indices and column names):

view plaincopy to clipboardprint?
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)  

Came across the sklearn-pandas package. It's focused on making scikit-learn easier to use with pandas. sklearn-pandas is especially useful when you need to apply more than one type of transformation to column subsets of the DataFrame, a more common scenario. It's documented, but this is how you'd achieve the transformation we just performed:

view plaincopy to clipboardprint?
from sklearn_pandas import DataFrameMapper  
  
mapper = DataFrameMapper([(df.columns, StandardScaler())])  
scaled_features = mapper.fit_transform(df.copy(), 4)  
scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)  

程式扎記

標籤

2020年9月12日星期六

[ sklearn 常見問題 ] How to use sklearn fit_transform with pandas and return dataframe instead of numpy

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2020年9月12日 星期六