Pivot Tables
We have seen how the GroupBy abstraction lets us explore relationships within a dataset. A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The pivot table takes simple columnwise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. The difference between pivot tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation. That is, you splitapply- combine, but both the split and the combine happen across not a one dimensional index, but across a two-dimensional grid.
Motivating Pivot Tables
For the examples in this section, we’ll use the database of passengers on the Titanic, available through the Seaborn library (see “Visualization with Seaborn” on page 311):
- In [1]: import numpy as np
- In [2]: import pandas as pd
- In [3]: import seaborn as sns
- In [4]: titanic = sns.load_dataset('titanic')
- In [5]: titanic.head()
- Out[5]:
- survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
- 0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
- 1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
- 2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
- 3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
- 4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
Pivot Tables by Hand
To start learning more about this data, we might begin by grouping it according to gender, survival status, or some combination thereof. If you have read the previous section, you might be tempted to apply a GroupByoperation—for example, let’s look at survival rate by gender:
- In [7]: titanic.groupby('sex')[['survived']].mean()
- Out[7]:
- survived
- sex
- female 0.742038
- male 0.188908
This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class. Using the vocabulary of GroupBy, we might proceed using something like this: we group by class and gender, select survival, apply a mean aggregate, combine the resulting groups, and then unstack the hierarchical index to reveal the hidden multidimensionality. In code:
- In [8]: titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
- Out[8]:
- class First Second Third
- sex
- female 0.968085 0.921053 0.500000
- male 0.368852 0.157407 0.135447
Pivot Table Syntax
Here is the equivalent to the preceding operation using the pivot_table method of DataFrames:
- In [5]: titanic.pivot_table('survived', index='sex', columns='class')
- Out[5]:
- class First Second Third
- sex
- female 0.968085 0.921053 0.500000
- male 0.368852 0.157407 0.135447
Multilevel pivot tables
Just as in the GroupBy, the grouping in pivot tables can be specified with multiple levels, and via a number of options. For example, we might be interested in looking at age as a third dimension. We’ll bin the age using the pd.cut function:
- In [6]: age = pd.cut(titanic['age'], [0, 18, 80])
- In [8]: age.head()
- Out[8]:
- 0 (18, 80]
- 1 (18, 80]
- 2 (18, 80]
- 3 (18, 80]
- 4 (18, 80]
- Name: age, dtype: category
- Categories (2, interval[int64]): [(0, 18] < (18, 80]]
- In [9]: titanic.pivot_table('survived', ['sex', age], 'class')
- Out[9]:
- class First Second Third
- sex age
- female (0, 18] 0.909091 1.000000 0.511628
- (18, 80] 0.972973 0.900000 0.423729
- male (0, 18] 0.800000 0.600000 0.215686
- (18, 80] 0.375000 0.071429 0.133663
Additional pivot table options
The full call signature of the pivot_table method of DataFrames is as follows:
- '''
- Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame
- '''
- DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
The aggfunc keyword controls what type of aggregation is applied, which is a mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices ('sum', 'mean', 'count', 'min', 'max', etc.) or a function that implements an aggregation (np.sum(), min(), sum(), etc.). Additionally, it can be specified as a dictionary mapping a column to any of the above desired options:
- In [2]: titanic.pivot_table(index='sex', columns='class',
- ...: aggfunc={'survived':sum, 'fare':'mean'})
- Out[2]:
- fare survived
- class First Second Third First Second Third
- sex
- female 106.125798 21.970121 16.118810 91 70 72
- male 67.226127 19.741782 12.661633 45 17 47
At times it’s useful to compute totals along each grouping. This can be done via the margins keyword:
- In [4]: titanic.pivot_table('survived', index='sex', columns='class', margins=True)
- Out[4]:
- class First Second Third All
- sex
- female 0.968085 0.921053 0.500000 0.742038
- male 0.368852 0.157407 0.135447 0.188908
- All 0.629630 0.472826 0.242363 0.383838
Example: Birthrate Data
As a more interesting example, let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv (this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, this blog post):
- In [5]: births = pd.read_csv('Books/PyDS/births.csv')
- In [6]: births.head()
- Out[6]:
- year month day gender births
- 0 1969 1 1.0 F 4046
- 1 1969 1 1.0 M 4440
- 2 1969 1 2.0 F 4454
- 3 1969 1 2.0 M 4548
- 4 1969 1 3.0 F 4548
- In [6]: births['decade'] = 10 * (births['year'] // 10)
- In [7]: births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
- Out[7]:
- gender F M
- decade
- 1960 1753634 1846572
- 1970 16263075 17121550
- 1980 18310351 19243452
- 1990 19479454 20420553
- 2000 18229309 19106428
- In [9]: import matplotlib.pyplot as plt
- In [10]: sns.set()
- In [11]: births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
- In [12]: plt.ylabel('Total births per year')
- Out[12]: Text(0, 0.5, 'Total births per year')
- In [13]: plt.show()
With a simple pivot table and plot() method, we can immediately see the annual trend in births by gender.
Further data exploration
Though this doesn’t necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation:
- In [18]: quartiles = np.percentile(births['births'], [25, 50, 75])
- In [19]: mu = quartiles[0]
- In [20]: sig = 0.74 * (quartiles[2] - quartiles[0])
- In [24]: quartiles
- Out[24]: array([4358. , 4814. , 5289.5])
- In [25]: births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
- In [26]: births['day'] = births['day'].astype(int) # set 'day' column to integer
- In [28]: births.index = pd.to_datetime(10000 * births.year + 100 * births.month + births.day, format='%Y%m%d')
- In [29]: births['dayofweek'] = births.index.dayofweek
- import matplotlib.pyplot as plt
- import matplotlib as mpl
- births.pivot_table('births', index='dayofweek',
- columns='decade', aggfunc='mean').plot()
- plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
- plt.ylabel('mean births by day');
Apparently births are slightly less common on weekends than on weekdays!
Another interesting view is to plot the mean number of births by the day of the year. Let’s first group the data by month and day separately:
- In [34]: births_by_date = births.pivot_table('births', [births.index.month, births.index.day])
- In [35]: births_by_date.head()
- Out[35]:
- births
- 1 1 4009.225
- 2 4247.400
- 3 4500.900
- 4 4571.350
- 5 4603.625
- In [36]: births_by_date.index = [pd.datetime(2012, month, day) for (month, day) in births_by_date.index]
- In [37]: births_by_date.head()
- Out[37]:
- births
- 2012-01-01 4009.225
- 2012-01-02 4247.400
- 2012-01-03 4500.900
- 2012-01-04 4571.350
- 2012-01-05 4603.625
- # Plot the results
- fig, ax = plt.subplots(figsize=(12, 4))
- births_by_date.plot(ax=ax);
Looking at this short example, you can see that many of the Python and Pandas tools we’ve seen to this point can be combined and used to gain insight from a variety of datasets. We will see some more sophisticated applications of these data manipulations in future sections!
Vectorized String Operations
One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when one is working with (read: cleaning up) real-world data. In this section, we’ll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.
Introducing Pandas String Operations
We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:
This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you’re stuck using a more verbose loop syntax:
This is perhaps sufficient to work with some data, but it will break if there are any missing values. For example:
- In [11]: data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
- In [12]: [s.capitalize() for s in data]
- ---------------------------------------------------------------------------
- AttributeError Traceback (most recent call last)
12 -bb1b424b25b0> in
- In [13]: import pandas as pd
- In [14]: names = pd.Series(data)
- In [16]: names
- Out[16]:
- 0 peter
- 1 Paul
- 2 None
- 3 MARY
- 4 gUIDO
- dtype: object
- In [17]: names.str.capitalize()
- Out[17]:
- 0 Peter
- 1 Paul
- 2 None
- 3 Mary
- 4 Guido
- dtype: object
Tables of Pandas String Methods
If you have a good understanding of string manipulation in Python, most of Pandas’ string syntax is intuitive enough that it’s probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties. The examples in this section use the following series of names:
- In [20]: monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam', 'Eric Idle', 'Terry Jones', 'Michael Palin'])
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. For the details (whole support methods), you can refer to "Working with Text data".
Notice that these have various return values. Some, like lower(), return a series of strings:
- In [21]: monte.str.lower()
- Out[21]:
- 0 graham chapman
- 1 john cleese
- 2 terry gilliam
- 3 eric idle
- 4 terry jones
- 5 michael palin
- dtype: object
- In [22]: monte.str.len()
- Out[22]:
- 0 14
- 1 11
- 2 13
- 3 9
- 4 11
- 5 13
- dtype: int64
- In [23]: monte.str.startswith('T')
- Out[23]:
- 0 False
- 1 False
- 2 True
- 3 False
- 4 True
- 5 False
- dtype: bool
- In [24]: monte.str.split()
- Out[24]:
- 0 [Graham, Chapman]
- 1 [John, Cleese]
- 2 [Terry, Gilliam]
- 3 [Eric, Idle]
- 4 [Terry, Jones]
- 5 [Michael, Palin]
- dtype: object
Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in re module:
With these, you can do a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:
- In [25]: monte.str.extract('([A-za-z]+)')
- Out[25]:
- 0
- 0 Graham
- 1 John
- 2 Terry
- 3 Eric
- 4 Terry
- 5 Michael
- In [26]: monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
- Out[26]:
- 0 [Graham Chapman]
- 1 []
- 2 [Terry Gilliam]
- 3 []
- 4 [Terry Jones]
- 5 [Michael Palin]
- dtype: object
Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations (see Table 3-5).
Vectorized item access and slicing. The get() and slice() operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note that this behavior is also available through Python’s normal indexing syntax—for example, df.str.slice(0, 3) is equivalent to df.str[0:3]:
- In [27]: monte.str[0:3]
- Out[27]:
- 0 Gra
- 1 Joh
- 2 Ter
- 3 Eri
- 4 Ter
- 5 Mic
- dtype: object
These get() and slice() methods also let you access elements of arrays returned by slice(). For example, to extract the last name of each entry, we can combine slice() and get():
- In [28]: monte.str.split().str.get(-1)
- Out[28]:
- 0 Chapman
- 1 Cleese
- 2 Gilliam
- 3 Idle
- 4 Jones
- 5 Palin
- dtype: object
- In [29]: full_monte = pd.DataFrame({'name': monte, 'info':['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C', 'B|C|D']})
- In [30]: full_monte
- Out[30]:
- name info
- 0 Graham Chapman B|C|D
- 1 John Cleese B|D
- 2 Terry Gilliam A|C
- 3 Eric Idle B|D
- 4 Terry Jones B|C
- 5 Michael Palin B|C|D
- In [31]: full_monte['info'].str.get_dummies('|')
- Out[31]:
- A B C D
- 0 0 1 1 1
- 1 0 1 0 1
- 2 1 0 1 0
- 3 0 1 0 1
- 4 0 1 1 0
- 5 0 1 1 1
Example: Recipe Database
These vectorized string operations become most useful in the process of cleaning up messy, real-world data. Here I’ll walk through an example of that, using an open recipe database compiled from various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.
The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well. As of spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:
The database is in JSON format, so we will try pd.read_json to read it:
- >>> import pandas as pd
- >>> recipes = pd.read_json("recipeitems-latest.json")
- Traceback (most recent call last):
- File "
" , line 1, in - File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 354, in read_json
- date_unit).parse()
- File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 422, in parse
- self._parse_no_numpy()
- File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 639, in _parse_no_numpy
- loads(json, precise_float=self.precise_float), dtype=None)
- ValueError: Trailing data
- >>> with open('recipeitems-latest.json', 'r') as f:
- ... line = f.readline()
- ...
- >>> pd.read_json(line).shape
- (2, 12)
- >>> with open('recipeitems-latest.json', 'r') as f: # read the entire file into a python array
- ... # Extract each line
- ... data = (line.strip() for line in f)
- ... # Reformat so each line is the element of a list
- ... data_json = "[{0}]".format(','.join(data))
- ...
- >>> recipes = pd.read_json(data_json) # read the result as a JSON
- >>> recipes.shape
- (173278, 17)
- >>> recipes.iloc[0]
- _id {u'$oid': u'5160756b96cc62079cc2db15'}
- cookTime PT30M
- creator NaN
- dateModified NaN
- datePublished 2013-03-11
- description Late Saturday afternoon, after Marlboro Man ha...
- image http://static.thepioneerwoman.com/cooking/file...
- ingredients Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
- name Drop Biscuits and Sausage Gravy
- prepTime PT10M
- recipeCategory NaN
- recipeInstructions NaN
- recipeYield 12
- source thepioneerwoman
- totalTime NaN
- ts {u'$date': 1365276011104}
- url http://thepioneerwoman.com/cooking/2013/03/dro...
- Name: 0, dtype: object
- In [10]: recipes.ingredients.str.len().describe()
- Out[10]:
- count 173278.000000
- mean 244.617805
- std 146.705075
- min 0.000000
- 25% 147.000000
- 50% 221.000000
- 75% 314.000000
- max 9067.000000
- Name: ingredients, dtype: float64
- In [13]: recipes.name[recipes.ingredients.str.len().idxmax()]
- Out[13]: 'Carrot Pineapple Spice & Brownie Layer Cake with Whipped Cream & Cream Cheese Frosting and Marzipan Carrots'
We can do other aggregate explorations; for example, let’s see how many of the recipes are for breakfast food:
- In [14]: recipes.description.str.contains('[Bb]reakfast').sum()
- Out[14]: 3524
- In [15]: recipes.ingredients.str.contains('[Cc]innamon').sum()
- Out[15]: 10526
- In [16]: recipes.ingredients.str.contains('[Cc]inamon').sum()
- Out[16]: 11
A simple recipe recommender
given a list of ingredients, find a recipe that uses all those ingredients. While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row. So we will cheat a bit: we’ll start with a list of common ingredients, and simply search to see whether they are in each recipe’s ingredient list. For simplicity, let’s just stick with herbs and spices for the time being:
- In [17]: spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
- In [22]: spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE)) for spice in spice_list))
- In [23]: spice_df.head()
- Out[23]:
- salt pepper oregano sage parsley rosemary tarragon thyme paprika cumin
- 0 False False False True False False False False False False
- 1 False False False False False False False False False False
- 2 True True False False False False False False False True
- 3 False False False False False False False False False False
- 4 False False False False False False False False False False
- In [24]: selection = spice_df.query('parsley & paprika & tarragon')
- In [25]: len(selection)
- Out[25]: 10
- In [26]: recipes.name[selection.index]
- Out[26]:
- 2069 All cremat with a Little Gem, dandelion and wa...
- 74964 Lobster with Thermidor butter
- 93768 Burton's Southern Fried Chicken with White Gravy
- 113926 Mijo's Slow Cooker Shredded Beef
- 137686 Asparagus Soup with Poached Eggs
- 140530 Fried Oyster Po’boys
- 158475 Lamb shank tagine with herb tabbouleh
- 158486 Southern fried chicken in buttermilk
- 163175 Fried Chicken Sliders with Pickles + Slaw
- 165243 Bar Tartine Cauliflower Salad
- Name: name, dtype: object
Supplement
* High-Performance Pandas: eval() and query()
沒有留言:
張貼留言