2019年1月22日 星期二

[ Python 常見問題 ] Efficiently calculate the difference between two rows in dataframe

Source From Here 
Question 
Consider I have a dataframe as below: 
  1. >>> import pandas as pd  
  2. >>> import numpy as np  
  3. >>> df = pd.DataFrame([[12], [34]], columns=['f1''f2'], index=['r1''r2'])  
  4. >>> df  
  5.     f1  f2  
  6. r1   1   2  
  7. r2   3   4  

How should I efficiently calculate the absolute different between row r1 and r2 and create another row as r3 to keep the result. That is to say the result will look like: 
  1. >>> for cn in df.columns:  
  2. ...     diff_dat.append(abs(df[cn]['r1'] - df[cn]['r2']))  
  3. ...   
  4. >>> diff_dat  
  5. [22]  
  6. >>> df.append(pd.DataFrame([diff_dat], index=['r3'], columns=df.columns))  
  7.     f1  f2  
  8. r1   1   2  
  9. r2   3   4  
  10. r3   2   2  
How-To 
You can do this: 
  1. In [576]: df.append(df.diff().dropna().abs())  
  2. Out[583]:   
  3.      f1   f2  
  4. r1  1.0  2.0  
  5. r2  3.0  4.0  
  6. r2  2.0  2.0  
Use loc for select rows, subtract, get abs and last add new row by setting with enlargement
  1. df.loc['r3'] = (df.loc['r1'] - df.loc['r2']).abs()  
  2. print (df)  
  3.     f1  f2  
  4. r1   1   2  
  5. r2   3   4  
  6. r3   2   2  
Performance for 1000 columns: 
  1. np.random.seed(123)  
  2. df = pd.DataFrame(np.random.randint(10, size=(21000)), index=['r1''r2']).add_prefix('f')-5  
  3.   
  4. #Mayank Porwal solution  
  5. In [40]: %timeit df.append(df.diff().dropna().abs())  
  6. 1.51 ms ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  
  7.   
  8. #jezrael solution  
  9. In [41]: %timeit df.loc['r3'] = (df.loc['r1'] - df.loc['r2']).abs()  
  10. 663 µs ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  
  11.   
  12. #NaT3z solution  
  13. In [42]: %timeit df.loc["r3"] = df.apply(lambda c: abs(c["r1"] - c["r2"]), axis=0)  
  14. 967 µs ± 80.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  
For improve performance is possible use numpy: 
  1. In [49]: %timeit df.loc['r3'] = np.abs(df.loc['r1'].values - df.loc['r2'].values)  
  2. 414 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  

Supplement 
Pandas Doc - Indexing and Selecting Data

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...