程式扎記: [ Py DS ] Ch3 - Data Manipulation with Pandas (Part5)

Source From Here

Pivot Tables
We have seen how the GroupBy abstraction lets us explore relationships within a dataset. A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The pivot table takes simple columnwise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. The difference between pivot tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation. That is, you splitapply- combine, but both the split and the combine happen across not a one dimensional index, but across a two-dimensional grid.

Motivating Pivot Tables
For the examples in this section, we’ll use the database of passengers on the Titanic, available through the Seaborn library (see “Visualization with Seaborn” on page 311):

view plaincopy to clipboardprint?
In [1]: import numpy as np  
In [2]: import pandas as pd  
In [3]: import seaborn as sns  
In [4]: titanic = sns.load_dataset('titanic')  
In [5]: titanic.head()  
Out[5]:  
   survived  pclass     sex   age  sibsp  parch     fare embarked  class    who  adult_male deck  embark_town alive  alone  
0         0       3    male  22.0      1      0   7.2500        S  Third    man        True  NaN  Southampton    no  False  
1         1       1  female  38.0      1      0  71.2833        C  First  woman       False    C    Cherbourg   yes  False  
2         1       3  female  26.0      0      0   7.9250        S  Third  woman       False  NaN  Southampton   yes   True  
3         1       1  female  35.0      1      0  53.1000        S  First  woman       False    C  Southampton   yes  False  
4         0       3    male  35.0      0      0   8.0500        S  Third    man        True  NaN  Southampton    no   True  

This contains a wealth of information on each passenger of that ill-fated voyage, including gender, age, class, fare paid, and much more.

Pivot Tables by Hand
To start learning more about this data, we might begin by grouping it according to gender, survival status, or some combination thereof. If you have read the previous section, you might be tempted to apply a GroupByoperation—for example, let’s look at survival rate by gender:

view plaincopy to clipboardprint?
In [7]: titanic.groupby('sex')[['survived']].mean()  
Out[7]:  
        survived  
sex  
female  0.742038  
male    0.188908  

This immediately gives us some insight: overall, three of every four females on board survived, while only one in five males survived!

This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class. Using the vocabulary of GroupBy, we might proceed using something like this: we group by class and gender, select survival, apply a mean aggregate, combine the resulting groups, and then unstack the hierarchical index to reveal the hidden multidimensionality. In code:

view plaincopy to clipboardprint?
In [8]: titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()  
Out[8]:  
class      First    Second     Third  
sex  
female  0.968085  0.921053  0.500000  
male    0.368852  0.157407  0.135447  

This gives us a better idea of how both gender and class affected survival, but the code is starting to look a bit garbled. While each step of this pipeline makes sense in light of the tools we’ve previously discussed, the long string of code is not particularly easy to read or use. This two-dimensional GroupBy is common enough that Pandas includes a convenience routine, pivot_table, which succinctly handles this type of multidimensional aggregation.

Pivot Table Syntax
Here is the equivalent to the preceding operation using the pivot_table method of DataFrames:

view plaincopy to clipboardprint?
In [5]: titanic.pivot_table('survived', index='sex', columns='class')  
Out[5]:  
class      First    Second     Third  
sex  
female  0.968085  0.921053  0.500000  
male    0.368852  0.157407  0.135447  

This is eminently more readable than the GroupBy approach, and produces the same result. As you might expect of an early 20th-century transatlantic cruise, the survival gradient favors both women and higher classes. First-class women survived with near certainty (hi, Rose!), while only one in ten third-class men survived (sorry, Jack!).

Multilevel pivot tables
Just as in the GroupBy, the grouping in pivot tables can be specified with multiple levels, and via a number of options. For example, we might be interested in looking at age as a third dimension. We’ll bin the age using the pd.cut function:

view plaincopy to clipboardprint?
In [6]: age = pd.cut(titanic['age'], [0, 18, 80])  
In [8]: age.head()  
Out[8]:  
0    (18, 80]  
1    (18, 80]  
2    (18, 80]  
3    (18, 80]  
4    (18, 80]  
Name: age, dtype: category  
Categories (2, interval[int64]): [(0, 18] < (18, 80]]  
  
In [9]: titanic.pivot_table('survived', ['sex', age], 'class')  
Out[9]:  
class               First    Second     Third  
sex    age  
female (0, 18]   0.909091  1.000000  0.511628  
       (18, 80]  0.972973  0.900000  0.423729  
male   (0, 18]   0.800000  0.600000  0.215686  
       (18, 80]  0.375000  0.071429  0.133663  

The result is a four-dimensional aggregation with hierarchical indices (see “Hierarchical Indexing” on page 128), shown in a grid demonstrating the relationship between the values.

Additional pivot table options
The full call signature of the pivot_table method of DataFrames is as follows:

view plaincopy to clipboardprint?
'''  
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame  
'''  
DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')  

We’ve already seen examples of the first three arguments; here we’ll take a quick look at the remaining ones. Two of the options, fill_value and dropna, have to do with missing data and are fairly straightforward; we will not show examples of them here.

The aggfunc keyword controls what type of aggregation is applied, which is a mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices ('sum', 'mean', 'count', 'min', 'max', etc.) or a function that implements an aggregation (np.sum(), min(), sum(), etc.). Additionally, it can be specified as a dictionary mapping a column to any of the above desired options:

view plaincopy to clipboardprint?
In [2]: titanic.pivot_table(index='sex', columns='class',  
   ...:                              aggfunc={'survived':sum, 'fare':'mean'})  
Out[2]:  
              fare                       survived  
class        First     Second      Third    First Second Third  
sex  
female  106.125798  21.970121  16.118810       91     70    72  
male     67.226127  19.741782  12.661633       45     17    47  

Notice also here that we’ve omitted the values keyword; when you’re specifying a mapping for aggfunc, this is determined automatically.

At times it’s useful to compute totals along each grouping. This can be done via the margins keyword:

view plaincopy to clipboardprint?
In [4]: titanic.pivot_table('survived', index='sex', columns='class', margins=True)  
Out[4]:  
class      First    Second     Third       All  
sex  
female  0.968085  0.921053  0.500000  0.742038  
male    0.368852  0.157407  0.135447  0.188908  
All     0.629630  0.472826  0.242363  0.383838  

Here this automatically gives us information about the class-agnostic survival rate by gender, the gender-agnostic survival rate by class, and the overall survival rate of 38%. The margin label can be specified with the margins_name keyword, which defaults to "All".

Example: Birthrate Data
As a more interesting example, let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv (this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, this blog post):

view plaincopy to clipboardprint?
In [5]: births = pd.read_csv('Books/PyDS/births.csv')  
In [6]: births.head()  
Out[6]:  
   year  month  day gender  births  
0  1969      1  1.0      F    4046  
1  1969      1  1.0      M    4440  
2  1969      1  2.0      F    4454  
3  1969      1  2.0      M    4548  
4  1969      1  3.0      F    4548  

Taking a look at the data, we see that it’s relatively simple—it contains the number of births grouped by date and gender. We can start to understand this data a bit more by using a pivot table. Let’s add a decadecolumn, and take a look at male and female births as a function of decade:

view plaincopy to clipboardprint?
In [6]: births['decade'] = 10 * (births['year'] // 10)  
  
In [7]: births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')  
Out[7]:  
gender         F         M  
decade  
1960     1753634   1846572  
1970    16263075  17121550  
1980    18310351  19243452  
1990    19479454  20420553  
2000    18229309  19106428  

We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visualize the total number of births by year (Figure 3-2; see Chapter 4 for a discussion of plotting with Matplotlib):

In [9]: import matplotlib.pyplot as plt
In [10]: sns.set()
In [11]: births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
In [12]: plt.ylabel('Total births per year')
Out[12]: Text(0, 0.5, 'Total births per year')
In [13]: plt.show()

With a simple pivot table and plot() method, we can immediately see the annual trend in births by gender.

Further data exploration
Though this doesn’t necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation:

view plaincopy to clipboardprint?
In [18]: quartiles = np.percentile(births['births'], [25, 50, 75])  
In [19]: mu = quartiles[0]  
In [20]: sig = 0.74 * (quartiles[2] - quartiles[0])  

This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution. With this we can use the query() method (discussed further in “High-Performance Pandas: eval() and query()” on page 208) to filter out rows with births outside these values:

view plaincopy to clipboardprint?
In [24]: quartiles  
Out[24]: array([4358. , 4814. , 5289.5])  
  
In [25]: births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')  

Next we set the day column to integers; previously it had been a string because some columns in the dataset contained the value 'null':

view plaincopy to clipboardprint?
In [26]: births['day'] = births['day'].astype(int)  # set 'day' column to integer  

Finally, we can combine the day, month, and year to create a Date index (see “Working with Time Series” on page 188). This allows us to quickly compute the weekday corresponding to each row:

view plaincopy to clipboardprint?
In [28]: births.index = pd.to_datetime(10000 * births.year + 100 * births.month + births.day, format='%Y%m%d')  
  
In [29]: births['dayofweek'] = births.index.dayofweek  

Using this we can plot births by weekday for several decades (Figure 3-3):

view plaincopy to clipboardprint?
import matplotlib.pyplot as plt  
import matplotlib as mpl  
births.pivot_table('births', index='dayofweek',  
columns='decade', aggfunc='mean').plot()  
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])  
plt.ylabel('mean births by day');  

Apparently births are slightly less common on weekends than on weekdays!

Another interesting view is to plot the mean number of births by the day of the year. Let’s first group the data by month and day separately:

view plaincopy to clipboardprint?
In [34]: births_by_date = births.pivot_table('births', [births.index.month, births.index.day])  
  
In [35]: births_by_date.head()  
Out[35]:  
       births  
1 1  4009.225  
  2  4247.400  
  3  4500.900  
  4  4571.350  
  5  4603.625  

The result is a multi-index over months and days. To make this easily plottable, let’s turn these months and days into a date by associating them with a dummy year variable (making sure to choose a leap year so February 29th is correctly handled!)

view plaincopy to clipboardprint?
In [36]: births_by_date.index = [pd.datetime(2012, month, day) for (month, day) in births_by_date.index]  
  
In [37]: births_by_date.head()  
Out[37]:  
              births  
2012-01-01  4009.225  
2012-01-02  4247.400  
2012-01-03  4500.900  
2012-01-04  4571.350  
2012-01-05  4603.625  

Focusing on the month and day only, we now have a time series reflecting the average number of births by date of the year. From this, we can use the plot method to plot the data (Figure 3-4). It reveals some interesting trends:

view plaincopy to clipboardprint?
# Plot the results  
fig, ax = plt.subplots(figsize=(12, 4))  
births_by_date.plot(ax=ax);  

Looking at this short example, you can see that many of the Python and Pandas tools we’ve seen to this point can be combined and used to gain insight from a variety of datasets. We will see some more sophisticated applications of these data manipulations in future sections!

Vectorized String Operations
One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when one is working with (read: cleaning up) real-world data. In this section, we’ll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.

Introducing Pandas String Operations
We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

In [6]: x = np.array([2, 3, 5, 7, 11, 13])
In [8]: x * 2
Out[8]: array([ 4, 6, 10, 14, 22, 26])

This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you’re stuck using a more verbose loop syntax:

In [9]: data = ['peter', 'Paul', 'MARY', 'gUIDO']
In [10]: [s.capitalize() for s in data]
Out[10]: ['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing values. For example:

view plaincopy to clipboardprint?
In [11]: data = ['peter', 'Paul', None, 'MARY', 'gUIDO']  
In [12]: [s.capitalize() for s in data]  
---------------------------------------------------------------------------  
AttributeError                            Traceback (most recent call last)  
12-bb1b424b25b0> in   

----> 1 [s.capitalize() for s in data]

-bb1b424b25b0> in (.0)

----> 1 [s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the str attribute of Pandas Series and Index objects containing strings. So, for example, suppose we create a Pandas Series with this data:

view plaincopy to clipboardprint?
In [13]: import pandas as pd  
In [14]: names = pd.Series(data)  
In [16]: names  
Out[16]:  
0    peter  
1     Paul  
2     None  
3     MARY  
4    gUIDO  
dtype: object  

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

view plaincopy to clipboardprint?
In [17]: names.str.capitalize()  
Out[17]:  
0    Peter  
1     Paul  
2     None  
3     Mary  
4    Guido  
dtype: object  

Using tab completion on this str attribute will list all the vectorized string methods available to Pandas.

Tables of Pandas String Methods
If you have a good understanding of string manipulation in Python, most of Pandas’ string syntax is intuitive enough that it’s probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties. The examples in this section use the following series of names:

view plaincopy to clipboardprint?
In [20]: monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam', 'Eric Idle', 'Terry Jones', 'Michael Palin'])  

Methods similar to Python string methods
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. For the details (whole support methods), you can refer to "Working with Text data".

Notice that these have various return values. Some, like lower(), return a series of strings:

view plaincopy to clipboardprint?
In [21]: monte.str.lower()  
Out[21]:  
0    graham chapman  
1       john cleese  
2     terry gilliam  
3         eric idle  
4       terry jones  
5     michael palin  
dtype: object  

But some others return numbers:

view plaincopy to clipboardprint?
In [22]: monte.str.len()  
Out[22]:  
0    14  
1    11  
2    13  
3     9  
4    11  
5    13  
dtype: int64  

Or Boolean values:

view plaincopy to clipboardprint?
In [23]: monte.str.startswith('T')  
Out[23]:  
0    False  
1    False  
2     True  
3    False  
4     True  
5    False  
dtype: bool  

Still others return lists or other compound values for each element:

view plaincopy to clipboardprint?
In [24]: monte.str.split()  
Out[24]:  
0    [Graham, Chapman]  
1       [John, Cleese]  
2     [Terry, Gilliam]  
3         [Eric, Idle]  
4       [Terry, Jones]  
5     [Michael, Palin]  
dtype: object  

We’ll see further manipulations of this kind of series-of-lists object as we continue our discussion.

Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in re module:

With these, you can do a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:

view plaincopy to clipboardprint?
In [25]: monte.str.extract('([A-za-z]+)')  
Out[25]:  
         0  
0   Graham  
1     John  
2    Terry  
3     Eric  
4    Terry  
5  Michael  

Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (^) and end-of-string ($) regular expression characters:

view plaincopy to clipboardprint?
In [26]: monte.str.findall(r'^[^AEIOU].*[^aeiou]$')  
Out[26]:  
0    [Graham Chapman]  
1                  []  
2     [Terry Gilliam]  
3                  []  
4       [Terry Jones]  
5     [Michael Palin]  
dtype: object  

The ability to concisely apply regular expressions across Series or DataFrame entries opens up many possibilities for analysis and cleaning of data.

Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations (see Table 3-5).

Vectorized item access and slicing. The get() and slice() operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note that this behavior is also available through Python’s normal indexing syntax—for example, df.str.slice(0, 3) is equivalent to df.str[0:3]:

view plaincopy to clipboardprint?
In [27]: monte.str[0:3]  
Out[27]:  
0    Gra  
1    Joh  
2    Ter  
3    Eri  
4    Ter  
5    Mic  
dtype: object  

Indexing via df.str.get(i) and df.str[i] is similar.

These get() and slice() methods also let you access elements of arrays returned by slice(). For example, to extract the last name of each entry, we can combine slice() and get():

view plaincopy to clipboardprint?
In [28]: monte.str.split().str.get(-1)  
Out[28]:  
0    Chapman  
1     Cleese  
2    Gilliam  
3       Idle  
4      Jones  
5      Palin  
dtype: object  

Indicator variables. Another method that requires a bit of extra explanation is the get_dummies() method. This is useful when your data has a column containing some sort of coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A=“born in America,” B=“born in the United Kingdom,” C=“likes cheese,” D=“likes spam”:

view plaincopy to clipboardprint?
In [29]: full_monte = pd.DataFrame({'name': monte, 'info':['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C', 'B|C|D']})  
  
In [30]: full_monte  
Out[30]:  
             name   info  
0  Graham Chapman  B|C|D  
1     John Cleese    B|D  
2   Terry Gilliam    A|C  
3       Eric Idle    B|D  
4     Terry Jones    B|C  
5   Michael Palin  B|C|D  

The get_dummies() routine lets you quickly split out these indicator variables into a DataFrame:

view plaincopy to clipboardprint?
In [31]: full_monte['info'].str.get_dummies('|')  
Out[31]:  
   A  B  C  D  
0  0  1  1  1  
1  0  1  0  1  
2  1  0  1  0  
3  0  1  0  1  
4  0  1  1  0  
5  0  1  1  1  

With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.

Example: Recipe Database
These vectorized string operations become most useful in the process of cleaning up messy, real-world data. Here I’ll walk through an example of that, using an open recipe database compiled from various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.

The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well. As of spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:

# wget https://github.com/sameergarg/scala-elasticsearch/...onf/recipeitems-latest.json.gz
# gunzip recipeitems-latest.json.gz
# ls -hl
total 136M
-rw-r--r-- 1 root root 136M Dec 8 10:24 recipeitems-latest.json

The database is in JSON format, so we will try pd.read_json to read it:

view plaincopy to clipboardprint?
>>> import pandas as pd  
>>> recipes = pd.read_json("recipeitems-latest.json")  
Traceback (most recent call last):  
  File "", line 1, in   
  File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 354, in read_json  
    date_unit).parse()  
  File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 422, in parse  
    self._parse_no_numpy()  
  File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 639, in _parse_no_numpy  
    loads(json, precise_float=self.precise_float), dtype=None)  
ValueError: Trailing data  

Oops! We get a ValueError mentioning that there is “trailing data.” Searching for this error on the Internet, it seems that it’s due to using a file in which each line is itself a valid JSON, but the full file is not. Let’s check if this interpretation is true:

view plaincopy to clipboardprint?
>>> with open('recipeitems-latest.json', 'r') as f:  
...     line = f.readline()  
...  
>>> pd.read_json(line).shape  
(2, 12)  

Yes, apparently each line is a valid JSON, so we’ll need to string them together. One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with pd.read_json:

view plaincopy to clipboardprint?
>>> with open('recipeitems-latest.json', 'r') as f: # read the entire file into a python array  
...     # Extract each line  
...     data = (line.strip() for line in f)  
...     # Reformat so each line is the element of a list  
...     data_json = "[{0}]".format(','.join(data))  
...  
>>> recipes = pd.read_json(data_json) # read the result as a JSON  
>>> recipes.shape  
(173278, 17)  

We see there are nearly 200,000 recipes, and 17 columns. Let’s take a look at one row to see what we have:

view plaincopy to clipboardprint?
>>> recipes.iloc[0]  
_id                              {u'$oid': u'5160756b96cc62079cc2db15'}  
cookTime                                                          PT30M  
creator                                                             NaN  
dateModified                                                        NaN  
datePublished                                                2013-03-11  
description           Late Saturday afternoon, after Marlboro Man ha...  
image                 http://static.thepioneerwoman.com/cooking/file...  
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...  
name                                    Drop Biscuits and Sausage Gravy  
prepTime                                                          PT10M  
recipeCategory                                                      NaN  
recipeInstructions                                                  NaN  
recipeYield                                                          12  
source                                                  thepioneerwoman  
totalTime                                                           NaN  
ts                                            {u'$date': 1365276011104}  
url                   http://thepioneerwoman.com/cooking/2013/03/dro...  
Name: 0, dtype: object  

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web. In particular, the ingredient list is in string format; we’re going to have to carefully extract the information we’re interested in. Let’s start by taking a closer look at the ingredients:

view plaincopy to clipboardprint?
In [10]: recipes.ingredients.str.len().describe()  
Out[10]:  
count    173278.000000  
mean        244.617805  
std         146.705075  
min           0.000000  
25%         147.000000  
50%         221.000000  
75%         314.000000  
max        9067.000000  
Name: ingredients, dtype: float64  

The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters! Just out of curiosity, let’s see which recipe has the longest ingredient list:

view plaincopy to clipboardprint?
In [13]: recipes.name[recipes.ingredients.str.len().idxmax()]  
Out[13]: 'Carrot Pineapple Spice & Brownie Layer Cake with Whipped Cream & Cream Cheese Frosting and Marzipan Carrots'  

That certainly looks like an involved recipe.

We can do other aggregate explorations; for example, let’s see how many of the recipes are for breakfast food:

view plaincopy to clipboardprint?
In [14]: recipes.description.str.contains('[Bb]reakfast').sum()  
Out[14]: 3524  

Or how many of the recipes list cinnamon as an ingredient:

view plaincopy to clipboardprint?
In [15]: recipes.ingredients.str.contains('[Cc]innamon').sum()  
Out[15]: 10526  

We could even look to see whether any recipes misspell the ingredient as “cinamon”:

view plaincopy to clipboardprint?
In [16]: recipes.ingredients.str.contains('[Cc]inamon').sum()  
Out[16]: 11  

This is the type of essential data exploration that is possible with Pandas string tools. It is data munging like this that Python really excels at.

A simple recipe recommender
given a list of ingredients, find a recipe that uses all those ingredients. While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row. So we will cheat a bit: we’ll start with a list of common ingredients, and simply search to see whether they are in each recipe’s ingredient list. For simplicity, let’s just stick with herbs and spices for the time being:

view plaincopy to clipboardprint?
In [17]: spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley', 'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']  

We can then build a Boolean DataFrame consisting of True and False values, indicating whether this ingredient appears in the list:

view plaincopy to clipboardprint?
In [22]: spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE)) for spice in spice_list))  
  
In [23]: spice_df.head()  
Out[23]:  
    salt  pepper  oregano   sage  parsley  rosemary  tarragon  thyme  paprika  cumin  
0  False   False    False   True    False     False     False  False    False  False  
1  False   False    False  False    False     False     False  False    False  False  
2   True    True    False  False    False     False     False  False    False   True  
3  False   False    False  False    False     False     False  False    False  False  
4  False   False    False  False    False     False     False  False    False  False  

Now, as an example, let’s say we’d like to find a recipe that uses parsley, paprika, and tarragon. We can compute this very quickly using the query() method of DataFrames:

view plaincopy to clipboardprint?
In [24]: selection = spice_df.query('parsley & paprika & tarragon')  
In [25]: len(selection)  
Out[25]: 10  

We find only 10 recipes with this combination; let’s use the index returned by this selection to discover the names of the recipes that have this combination:

view plaincopy to clipboardprint?
In [26]: recipes.name[selection.index]  
Out[26]:  
2069      All cremat with a Little Gem, dandelion and wa...  
74964                         Lobster with Thermidor butter  
93768      Burton's Southern Fried Chicken with White Gravy  
113926                     Mijo's Slow Cooker Shredded Beef  
137686                     Asparagus Soup with Poached Eggs  
140530                                 Fried Oyster Po’boys  
158475                Lamb shank tagine with herb tabbouleh  
158486                 Southern fried chicken in buttermilk  
163175            Fried Chicken Sliders with Pickles + Slaw  
165243                        Bar Tartine Cauliflower Salad  
Name: name, dtype: object  

Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we’d like to cook for dinner. This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.

Supplement
* High-Performance Pandas: eval() and query()

程式扎記

標籤

2018年12月7日星期五

[ Py DS ] Ch3 - Data Manipulation with Pandas (Part5)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2018年12月7日 星期五