High-Performance Pandas: eval() and query()
As we’ve already seen in previous chapters, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into C via an intuitive syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas. While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.
As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the eval() and query()functions, which rely on the Numexpr package. In this notebook we will walk through their use and give some rules of thumb about when you might think about using them.
Motivating query() and eval(): Compound Expressions
We’ve seen previously that NumPy and Pandas support fast vectorized operations; for example, when you are adding the elements of two arrays:
- In [101]: import numpy as np
- In [102]: rng = np.random.RandomState(42)
- In [107]: x = rng.rand(int(1E6))
- In [108]: y = rng.rand(int(1E6))
- In [109]: %timeit x + y
- 3.64 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- In [110]: %timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))
- 343 ms ± 5.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
- In [111]: %timeit mask = (x > 0.5) & (y < 0.5)
- 2.09 ms ± 9.65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
- In [112]: %timeit tmp1 = (x > 0.5); tmp2 = (y < 0.5); mask = tmp1 & tmp2
- 1.08 ms ± 5.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
- In [12]: np.allclose(mask, mask_numexpr)
- Out[12]: True
pandas.eval() for Efficient Operations
The eval() function in Pandas uses string expressions to efficiently compute operations using DataFrames. For example, consider the following DataFrames:
- In [1]: import pandas as pd
- In [2]: nrows, ncols = 100000, 100
- In [3]: import numpy as np
- In [4]: rng = np.random.RandomState(42)
- In [5]: df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in range(4))
- In [6]: %timeit df1 + df2 + df3 + df4
- 70.4 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
- In [7]: %timeit pd.eval('df1 + df2 + df3 + df4')
- 31.8 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Operations supported by pd.eval()
As of Pandas v0.16, pd.eval() supports a wide range of operations. To demonstrate these, we’ll use the following integer DataFrames:
- In [8]: df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3))) for i in range(5))
- In [9]: result1 = -df1 * df2 / (df3 + df4) - df5
- In [10]: result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
- In [11]: np.allclose(result1, result2)
- Out[11]: True
- In [12]: result1 = (df1 < df2) & (df2 <= df3) | (df3 != df4)
- In [13]: result2 = pd.eval('(df1 < df2) & (df2 <= df3) | (df3 != df4)')
- In [15]: np.allclose(result1, result2)
- Out[15]: True
- In [19]: result3 = pd.eval('(df1 < df2) and (df2 <= df3) or (df3 != df4)')
- In [23]: np.allclose(result1, result3)
- Out[23]: True
- In [31]: result1 = df2.T[0] + df3.iloc[1]
- In [32]: result2 = pd.eval('df2.T[0] + df3.iloc[1]')
- In [33]: np.allclose(result1, result2)
- Out[33]: True
DataFrame.eval() for Column-Wise Operations
Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method that works in similar ways. The benefit of the eval() method is that columns can be referred to by name. We’ll use this labeled array as an example:
- In [5]: df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
- In [6]: df.head()
- Out[6]:
- A B C
- 0 0.374540 0.950714 0.731994
- 1 0.598658 0.156019 0.155995
- 2 0.058084 0.866176 0.601115
- 3 0.708073 0.020584 0.969910
- 4 0.832443 0.212339 0.181825
- In [7]: result1 = (df['A'] + df['B']) / (df['C'] - 1)
- In [8]: result1.head()
- Out[8]:
- 0 -4.944867
- 1 -0.894161
- 2 -2.317108
- 3 -24.215802
- 4 -1.276966
- dtype: float64
- In [11]: result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
- In [12]: np.allclose(result1, result2)
- Out[12]: True
- In [13]: result3 = df.eval('(A + B) / (C - 1)')
- In [14]: np.allclose(result1, result3)
- Out[14]: True
Assignment in DataFrame.eval()
In addition to the options just discussed, DataFrame.eval() also allows assignment to any column. Let’s use the DataFrame from before, which has columns 'A', 'B', and 'C':
- In [5]: df.head()
- Out[5]:
- A B C
- 0 0.374540 0.950714 0.731994
- 1 0.598658 0.156019 0.155995
- 2 0.058084 0.866176 0.601115
- 3 0.708073 0.020584 0.969910
- 4 0.832443 0.212339 0.181825
- In [6]: df.eval('D = (A + B) / C', inplace=True)
- In [7]: df.head()
- Out[7]:
- A B C D
- 0 0.374540 0.950714 0.731994 1.810472
- 1 0.598658 0.156019 0.155995 4.837844
- 2 0.058084 0.866176 0.601115 1.537576
- 3 0.708073 0.020584 0.969910 0.751263
- 4 0.832443 0.212339 0.181825 5.746085
- In [8]: df.eval('D = (A - B) / C', inplace=True)
- In [9]: df.head()
- Out[9]:
- A B C D
- 0 0.374540 0.950714 0.731994 -0.787130
- 1 0.598658 0.156019 0.155995 2.837535
- 2 0.058084 0.866176 0.601115 -1.344323
- 3 0.708073 0.020584 0.969910 0.708816
- 4 0.832443 0.212339 0.181825 3.410442
The DataFrame.eval() method supports an additional syntax that lets it work with local Python variables. Consider the following:
- In [14]: column_mean = df.mean(1)
- In [15]: result1 = df['A'] + column_mean
- In [16]: result2 = df.eval('A + @column_mean')
- In [17]: np.allclose(result1, result2)
- Out[17]: True
DataFrame.query() Method
The DataFrame has another method based on evaluated strings, called the query() method. Consider the following:
- In [19]: result1 = df[(df.A < 0.5) & (df.B < 0.5)]
- In [20]: result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
- In [21]: np.allclose(result1, result2)
- Out[21]: True
- In [23]: result2 = df.query('A < 0.5 and B < 0.5')
- In [24]: np.allclose(result1, result2)
- Out[24]: True
- In [26]: Cmean
- Out[26]: 0.4934747264008902
- In [27]: result1 = df[(df.A < Cmean) & (df.B < Cmean)]
- In [28]: result2 = df.query('A < @Cmean and B < @Cmean')
- In [29]: np.allclose(result1, result2)
- Out[29]: True
When considering whether to use these functions, there are two considerations: computation time and memory use. Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas DataFrames will result in implicit creation of temporary arrays: For example, this:
- In [30]: x = df[(df.A < 0.5) & (df.B < 0.5)]
- tmp1 = df.A < 0.5
- tmp2 = df.B < 0.5
- tmp3 = tmp1 & tmp2
- x = df[tmp3]
- In [32]: df.values.nbytes
- Out[32]: 32000
We’ve covered most of the details of eval() and query() here; for more information on these, you can refer to the Pandas documentation. In particular, different parsers and engines can be specified for running these queries; for details on this, see the discussion within the “Enhancing Performance” section.
Further Resources
We’ve covered many of the basics of using Pandas effectively for data analysis. Still, much has been omitted from our discussion. To learn more about Pandas, I recommend the following resources:
Pandas online documentation
Python for Data Analysis
Pandas on Stack Overflow
Pandas on PyVideo
My hope is that, by using these resources, combined with the walk-through given in this chapter, you’ll be poised to use Pandas to tackle any data analysis problem you come across!
沒有留言:
張貼留言