2020年9月12日 星期六

[ sklearn 常見問題 ] How to use sklearn fit_transform with pandas and return dataframe instead of numpy

 Source From Here

Question
I want to apply scaling (using StandardScaler) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want:
  1. features = df[["col1""col2""col3""col4"]]  
  2. autoscaler = StandardScaler()  
  3. features = autoscaler.fit_transform(features)  
A "solution" I found online is:
  1. features = features.apply(lambda x: autoscaler.fit_transform(x))  
How do I apply scaling to the pandas dataframe, leaving the dataframe intact? Without copying the data if possible.

How-To
You could convert the numpy array to as DataFrame as below:
>>> import pandas as pd
>>> import numpy as np
>>> datas = np.random.randint(0.0, 100, size=(10,4))
>>> datas
  1. array([[48663943],  
  2.        [30382255],  
  3.        [93712542],  
  4.        [26943142],  
  5.        [97141076],  
  6.        [279083,  4],  
  7.        [48401381],  
  8.        [48791951],  
  9.        [13,  22719],  
  10.        [75495889]])  

>>> df = pd.DataFrame(datas, index=range(10, 20), columns=['c1', 'c2', 'c3', 'c4'], dtype='float64')
>>> df
  1.       c1    c2    c3    c4  
  2. 10  48.0  66.0  39.0  43.0  
  3. 11  30.0  38.0  22.0  55.0  
  4. 12  93.0  71.0  25.0  42.0  
  5. 13  26.0  94.0  31.0  42.0  
  6. 14  97.0  14.0  10.0  76.0  
  7. 15  27.0  90.0  83.0   4.0  
  8. 16  48.0  40.0  13.0  81.0  
  9. 17  48.0  79.0  19.0  51.0  
  10. 18  13.0   2.0  27.0  19.0  
  11. 19  75.0  49.0  58.0  89.0  

Now fit_transform the DataFrame to get the scaled_features array:
  1. from sklearn.preprocessing import StandardScaler  
  2. scaled_features = StandardScaler().fit_transform(df.values)  
  3. print(scaled_features[:3,:] #lost the indices)  
Will get:
  1. array([[-1.89007341,  0.05636005,  1.74514417,  0.46669562],  
  2.        [ 1.26558518, -1.35264122,  0.82178747,  0.59282958],  
  3.        [ 0.93341059,  0.37841748, -0.60941542,  0.59282958]])  
Assign the scaled data to a DataFrame (Note: use the index and columns keyword arguments to keep your original indices and column names):
  1. scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)  
Came across the sklearn-pandas package. It's focused on making scikit-learn easier to use with pandas. sklearn-pandas is especially useful when you need to apply more than one type of transformation to column subsets of the DataFrame, a more common scenario. It's documented, but this is how you'd achieve the transformation we just performed:
  1. from sklearn_pandas import DataFrameMapper  
  2.   
  3. mapper = DataFrameMapper([(df.columns, StandardScaler())])  
  4. scaled_features = mapper.fit_transform(df.copy(), 4)  
  5. scaled_features_df = pd.DataFrame(scaled_features, index=df.index, columns=df.columns)  

This message was edited 3 times. Last update was at 12/09/2020 15:17:04

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...