程式扎記: [ Py DS ] Ch3 - Data Manipulation with Pandas (Part7)

Source From Here

High-Performance Pandas: eval() and query()
As we’ve already seen in previous chapters, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into C via an intuitive syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas. While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.

As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the eval() and query()functions, which rely on the Numexpr package. In this notebook we will walk through their use and give some rules of thumb about when you might think about using them.

Motivating query() and eval(): Compound Expressions
We’ve seen previously that NumPy and Pandas support fast vectorized operations; for example, when you are adding the elements of two arrays:

view plaincopy to clipboardprint?
In [101]: import numpy as np  
In [102]: rng = np.random.RandomState(42)  
In [107]: x = rng.rand(int(1E6))  
In [108]: y = rng.rand(int(1E6))  
  
In [109]: %timeit x + y  
3.64 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  

As discussed in “Computation on NumPy Arrays: Universal Functions” on page 50, this is much faster than doing the addition via a Python loop or comprehension:

view plaincopy to clipboardprint?
In [110]: %timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))  
343 ms ± 5.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  

But this abstraction can become less efficient when you are computing compound expressions. For example, consider the following expression:

view plaincopy to clipboardprint?
In [111]: %timeit mask = (x > 0.5) & (y < 0.5)  
2.09 ms ± 9.65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  

Because NumPy evaluates each subexpression, this is roughly equivalent to the following:

view plaincopy to clipboardprint?
In [112]: %timeit tmp1 = (x > 0.5); tmp2 = (y < 0.5); mask = tmp1 & tmp2  
1.08 ms ± 5.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  

In other words, every intermediate step is explicitly allocated in memory. If the x and y arrays are very large, this can lead to significant memory and computational overhead. The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays. The Numexpr documentation has more details, but for the time being it is sufficient to say that the library accepts a string giving the NumPy-style expression you’d like to compute:

view plaincopy to clipboardprint?
In [12]: np.allclose(mask, mask_numexpr)  
Out[12]: True  

The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays. The Pandas eval() and query() tools that we will discuss here are conceptually similar, and depend on the Numexpr package.

pandas.eval() for Efficient Operations
The eval() function in Pandas uses string expressions to efficiently compute operations using DataFrames. For example, consider the following DataFrames:

view plaincopy to clipboardprint?
In [1]: import pandas as pd  
In [2]: nrows, ncols = 100000, 100  
In [3]: import numpy as np  
In [4]: rng = np.random.RandomState(42)  
In [5]: df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols)) for i in range(4))  

To compute the sum of all four DataFrames using the typical Pandas approach, we can just write the sum:

view plaincopy to clipboardprint?
In [6]: %timeit df1 + df2 + df3 + df4  
70.4 ms ± 246 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  

We can compute the same result via pd.eval by constructing the expression as a string:

view plaincopy to clipboardprint?
In [7]: %timeit pd.eval('df1 + df2 + df3 + df4')  
31.8 ms ± 407 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  

The eval() version of this expression is about 50% faster (and uses much less memory), while giving the same result.

Operations supported by pd.eval()
As of Pandas v0.16, pd.eval() supports a wide range of operations. To demonstrate these, we’ll use the following integer DataFrames:

view plaincopy to clipboardprint?
In [8]: df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3))) for i in range(5))  

Arithmetic operators. pd.eval() supports all arithmetic operators. For example:

view plaincopy to clipboardprint?
In [9]: result1 = -df1 * df2 / (df3 + df4) - df5  
In [10]: result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')  
In [11]: np.allclose(result1, result2)  
Out[11]: True  

Comparison operators. pd.eval() supports all comparison operators, including chained expressions:

view plaincopy to clipboardprint?
In [12]: result1 = (df1 < df2) & (df2 <= df3) | (df3 != df4)  
In [13]: result2 = pd.eval('(df1 < df2) & (df2 <= df3) | (df3 != df4)')  
In [15]: np.allclose(result1, result2)  
Out[15]: True  

In addition, it supports the use of the literal and and or in Boolean expressions:

view plaincopy to clipboardprint?
In [19]: result3 = pd.eval('(df1 < df2) and (df2 <= df3) or (df3 != df4)')  
In [23]: np.allclose(result1, result3)  
Out[23]: True  

Object attributes and indices. pd.eval() supports access to object attributes via the obj.attr syntax, and indexes via the obj[index] syntax:

view plaincopy to clipboardprint?
In [31]: result1 = df2.T[0] + df3.iloc[1]  
In [32]: result2 = pd.eval('df2.T[0] + df3.iloc[1]')  
In [33]: np.allclose(result1, result2)  
Out[33]: True  

Other operations. Other operations, such as function calls, conditional statements, loops, and other more involved constructs, are currently not implemented in pd.eval(). If you’d like to execute these more complicated types of expressions, you can use the Numexpr library itself.

DataFrame.eval() for Column-Wise Operations
Just as Pandas has a top-level pd.eval() function, DataFrames have an eval() method that works in similar ways. The benefit of the eval() method is that columns can be referred to by name. We’ll use this labeled array as an example:

view plaincopy to clipboardprint?
In [5]: df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])  
In [6]: df.head()  
Out[6]:  
          A         B         C  
0  0.374540  0.950714  0.731994  
1  0.598658  0.156019  0.155995  
2  0.058084  0.866176  0.601115  
3  0.708073  0.020584  0.969910  
4  0.832443  0.212339  0.181825  

Using pd.eval() as above, we can compute expressions with the three columns like this:

view plaincopy to clipboardprint?
In [7]: result1 = (df['A'] + df['B']) / (df['C'] - 1)  
In [8]: result1.head()  
Out[8]:  
0    -4.944867  
1    -0.894161  
2    -2.317108  
3   -24.215802  
4    -1.276966  
dtype: float64  
  
In [11]: result2 = pd.eval("(df.A + df.B) / (df.C - 1)")  
In [12]: np.allclose(result1, result2)  
Out[12]: True  

The DataFrame.eval() method allows much more succinct evaluation of expressions with the columns:

view plaincopy to clipboardprint?
In [13]: result3 = df.eval('(A + B) / (C - 1)')  
In [14]: np.allclose(result1, result3)  
Out[14]: True  

Notice here that we treat column names as variables within the evaluated expression, and the result is what we would wish.

Assignment in DataFrame.eval()
In addition to the options just discussed, DataFrame.eval() also allows assignment to any column. Let’s use the DataFrame from before, which has columns 'A', 'B', and 'C':

view plaincopy to clipboardprint?
In [5]: df.head()  
Out[5]:  
          A         B         C  
0  0.374540  0.950714  0.731994  
1  0.598658  0.156019  0.155995  
2  0.058084  0.866176  0.601115  
3  0.708073  0.020584  0.969910  
4  0.832443  0.212339  0.181825  

We can use df.eval() to create a new column 'D' and assign to it a value computed from the other columns:

view plaincopy to clipboardprint?
In [6]: df.eval('D = (A + B) / C', inplace=True)  
In [7]: df.head()  
Out[7]:  
          A         B         C         D  
0  0.374540  0.950714  0.731994  1.810472  
1  0.598658  0.156019  0.155995  4.837844  
2  0.058084  0.866176  0.601115  1.537576  
3  0.708073  0.020584  0.969910  0.751263  
4  0.832443  0.212339  0.181825  5.746085  

In the same way, any existing column can be modified:

view plaincopy to clipboardprint?
In [8]: df.eval('D = (A - B) / C', inplace=True)  
In [9]: df.head()  
Out[9]:  
          A         B         C         D  
0  0.374540  0.950714  0.731994 -0.787130  
1  0.598658  0.156019  0.155995  2.837535  
2  0.058084  0.866176  0.601115 -1.344323  
3  0.708073  0.020584  0.969910  0.708816  
4  0.832443  0.212339  0.181825  3.410442  

Local variables in DataFrame.eval()
The DataFrame.eval() method supports an additional syntax that lets it work with local Python variables. Consider the following:

view plaincopy to clipboardprint?
In [14]: column_mean = df.mean(1)  
In [15]: result1 = df['A'] + column_mean  
In [16]: result2 = df.eval('A + @column_mean')  
In [17]: np.allclose(result1, result2)  
Out[17]: True  

The @ character here marks a variable name rather than a column name, and lets you efficiently evaluate expressions involving the two “namespaces”: the namespace of columns, and the namespace of Python objects. Notice that this @ character is only supported by the DataFrame.eval() method, not by the pandas.eval() function, because the pandas.eval() function only has access to the one (Python) namespace.

DataFrame.query() Method
The DataFrame has another method based on evaluated strings, called the query() method. Consider the following:

view plaincopy to clipboardprint?
In [19]: result1 = df[(df.A < 0.5) & (df.B < 0.5)]  
In [20]: result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')  
In [21]: np.allclose(result1, result2)  
Out[21]: True  

As with the example used in our discussion of DataFrame.eval(), this is an expression involving columns of the DataFrame. It cannot be expressed using the DataFrame.eval() syntax, however! Instead, for this type of filtering operation, you can use the query() method:

view plaincopy to clipboardprint?
In [23]: result2 = df.query('A < 0.5 and B < 0.5')  
In [24]: np.allclose(result1, result2)  
Out[24]: True  

In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand. Note that the query() method also accepts the @ flag to mark local variables:

view plaincopy to clipboardprint?
In [26]: Cmean  
Out[26]: 0.4934747264008902  
  
In [27]: result1 = df[(df.A < Cmean) & (df.B < Cmean)]  
In [28]: result2 = df.query('A < @Cmean and B < @Cmean')  
In [29]: np.allclose(result1, result2)  
Out[29]: True  

Performance: When to Use These Functions
When considering whether to use these functions, there are two considerations: computation time and memory use. Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas DataFrames will result in implicit creation of temporary arrays: For example, this:

view plaincopy to clipboardprint?
In [30]: x = df[(df.A < 0.5) & (df.B < 0.5)]  

is roughly equivalent to this:

view plaincopy to clipboardprint?
tmp1 = df.A < 0.5  
tmp2 = df.B < 0.5  
tmp3 = tmp1 & tmp2  
x = df[tmp3]  

If the size of the temporary DataFrames is significant compared to your available system memory (typically several gigabytes), then it’s a good idea to use an eval() or query() expression. You can check the approximate size of your array in bytes using this:

view plaincopy to clipboardprint?
In [32]: df.values.nbytes  
Out[32]: 32000  

On the performance side, eval() can be faster even when you are not maxing out your system memory. The issue is how your temporary DataFrames compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then eval() can avoid some potentially slow movement of values between the different memory caches. In practice, I find that the difference in computation time between the traditional methods and the eval/query method is usually not significant—if anything, the traditional method is faster for smaller arrays! The benefit of eval/query is mainly in the saved memory, and the sometimes cleaner syntax they offer.

We’ve covered most of the details of eval() and query() here; for more information on these, you can refer to the Pandas documentation. In particular, different parsers and engines can be specified for running these queries; for details on this, see the discussion within the “Enhancing Performance” section.

Further Resources
We’ve covered many of the basics of using Pandas effectively for data analysis. Still, much has been omitted from our discussion. To learn more about Pandas, I recommend the following resources:

Pandas online documentation

This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.

Python for Data Analysis

Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).

Pandas on Stack Overflow

Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you’re coming across—more than likely you’ll find your answer on a Stack Overflow page.

Pandas on PyVideo

From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.

My hope is that, by using these resources, combined with the walk-through given in this chapter, you’ll be poised to use Pandas to tackle any data analysis problem you come across!

程式扎記

標籤

2019年1月4日星期五

[ Py DS ] Ch3 - Data Manipulation with Pandas (Part7)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2019年1月4日 星期五