2018年12月7日 星期五

[ Py DS ] Ch3 - Data Manipulation with Pandas (Part5)

Source From Here 


Pivot Tables 
We have seen how the GroupBy abstraction lets us explore relationships within a dataset. A pivot table is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data. The pivot table takes simple columnwise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. The difference between pivot tables and GroupBy can sometimes cause confusion; it helps me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation. That is, you splitapply- combine, but both the split and the combine happen across not a one dimensional index, but across a two-dimensional grid. 

Motivating Pivot Tables 
For the examples in this section, we’ll use the database of passengers on the Titanic, available through the Seaborn library (see “Visualization with Seaborn” on page 311): 
  1. In [1]: import numpy as np  
  2. In [2]: import pandas as pd  
  3. In [3]: import seaborn as sns  
  4. In [4]: titanic = sns.load_dataset('titanic')  
  5. In [5]: titanic.head()  
  6. Out[5]:  
  7.    survived  pclass     sex   age  sibsp  parch     fare embarked  class    who  adult_male deck  embark_town alive  alone  
  8. 0         0       3    male  22.0      1      0   7.2500        S  Third    man        True  NaN  Southampton    no  False  
  9. 1         1       1  female  38.0      1      0  71.2833        C  First  woman       False    C    Cherbourg   yes  False  
  10. 2         1       3  female  26.0      0      0   7.9250        S  Third  woman       False  NaN  Southampton   yes   True  
  11. 3         1       1  female  35.0      1      0  53.1000        S  First  woman       False    C  Southampton   yes  False  
  12. 4         0       3    male  35.0      0      0   8.0500        S  Third    man        True  NaN  Southampton    no   True  
This contains a wealth of information on each passenger of that ill-fated voyage, including gender, age, class, fare paid, and much more. 

Pivot Tables by Hand 
To start learning more about this data, we might begin by grouping it according to gender, survival status, or some combination thereof. If you have read the previous section, you might be tempted to apply a GroupByoperation—for example, let’s look at survival rate by gender: 
  1. In [7]: titanic.groupby('sex')[['survived']].mean()  
  2. Out[7]:  
  3.         survived  
  4. sex  
  5. female  0.742038  
  6. male    0.188908  
This immediately gives us some insight: overall, three of every four females on board survived, while only one in five males survived! 

This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class. Using the vocabulary of GroupBy, we might proceed using something like this: we group by class and gender, select survival, apply a mean aggregate, combine the resulting groups, and then unstack the hierarchical index to reveal the hidden multidimensionality. In code: 
  1. In [8]: titanic.groupby(['sex''class'])['survived'].aggregate('mean').unstack()  
  2. Out[8]:  
  3. class      First    Second     Third  
  4. sex  
  5. female  0.968085  0.921053  0.500000  
  6. male    0.368852  0.157407  0.135447  
This gives us a better idea of how both gender and class affected survival, but the code is starting to look a bit garbled. While each step of this pipeline makes sense in light of the tools we’ve previously discussed, the long string of code is not particularly easy to read or use. This two-dimensional GroupBy is common enough that Pandas includes a convenience routine, pivot_table, which succinctly handles this type of multidimensional aggregation

Pivot Table Syntax 
Here is the equivalent to the preceding operation using the pivot_table method of DataFrames: 
  1. In [5]: titanic.pivot_table('survived', index='sex', columns='class')  
  2. Out[5]:  
  3. class      First    Second     Third  
  4. sex  
  5. female  0.968085  0.921053  0.500000  
  6. male    0.368852  0.157407  0.135447  
This is eminently more readable than the GroupBy approach, and produces the same result. As you might expect of an early 20th-century transatlantic cruise, the survival gradient favors both women and higher classes. First-class women survived with near certainty (hi, Rose!), while only one in ten third-class men survived (sorry, Jack!). 

Multilevel pivot tables 
Just as in the GroupBy, the grouping in pivot tables can be specified with multiple levels, and via a number of options. For example, we might be interested in looking at age as a third dimension. We’ll bin the age using the pd.cut function: 
  1. In [6]: age = pd.cut(titanic['age'], [01880])  
  2. In [8]: age.head()  
  3. Out[8]:  
  4. 0    (1880]  
  5. 1    (1880]  
  6. 2    (1880]  
  7. 3    (1880]  
  8. 4    (1880]  
  9. Name: age, dtype: category  
  10. Categories (2, interval[int64]): [(018] < (1880]]  
  11.   
  12. In [9]: titanic.pivot_table('survived', ['sex', age], 'class')  
  13. Out[9]:  
  14. class               First    Second     Third  
  15. sex    age  
  16. female (018]   0.909091  1.000000  0.511628  
  17.        (1880]  0.972973  0.900000  0.423729  
  18. male   (018]   0.800000  0.600000  0.215686  
  19.        (1880]  0.375000  0.071429  0.133663  
The result is a four-dimensional aggregation with hierarchical indices (see “Hierarchical Indexing” on page 128), shown in a grid demonstrating the relationship between the values. 

Additional pivot table options 
The full call signature of the pivot_table method of DataFrames is as follows: 
  1. '''  
  2. Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame  
  3. '''  
  4. DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')  
We’ve already seen examples of the first three arguments; here we’ll take a quick look at the remaining ones. Two of the options, fill_value and dropna, have to do with missing data and are fairly straightforward; we will not show examples of them here. 

The aggfunc keyword controls what type of aggregation is applied, which is a mean by default. As in the GroupBy, the aggregation specification can be a string representing one of several common choices ('sum', 'mean', 'count', 'min', 'max', etc.) or a function that implements an aggregation (np.sum()min()sum(), etc.). Additionally, it can be specified as a dictionary mapping a column to any of the above desired options: 
  1. In [2]: titanic.pivot_table(index='sex', columns='class',  
  2.    ...:                              aggfunc={'survived':sum, 'fare':'mean'})  
  3. Out[2]:  
  4.               fare                       survived  
  5. class        First     Second      Third    First Second Third  
  6. sex  
  7. female  106.125798  21.970121  16.118810       91     70    72  
  8. male     67.226127  19.741782  12.661633       45     17    47  
Notice also here that we’ve omitted the values keyword; when you’re specifying a mapping for aggfunc, this is determined automatically. 

At times it’s useful to compute totals along each grouping. This can be done via the margins keyword: 
  1. In [4]: titanic.pivot_table('survived', index='sex', columns='class', margins=True)  
  2. Out[4]:  
  3. class      First    Second     Third       All  
  4. sex  
  5. female  0.968085  0.921053  0.500000  0.742038  
  6. male    0.368852  0.157407  0.135447  0.188908  
  7. All     0.629630  0.472826  0.242363  0.383838  
Here this automatically gives us information about the class-agnostic survival rate by gender, the gender-agnostic survival rate by class, and the overall survival rate of 38%. The margin label can be specified with the margins_name keyword, which defaults to "All". 

Example: Birthrate Data 
As a more interesting example, let’s take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC). This data can be found at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv (this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, this blog post): 
  1. In [5]: births = pd.read_csv('Books/PyDS/births.csv')  
  2. In [6]: births.head()  
  3. Out[6]:  
  4.    year  month  day gender  births  
  5. 0  1969      1  1.0      F    4046  
  6. 1  1969      1  1.0      M    4440  
  7. 2  1969      1  2.0      F    4454  
  8. 3  1969      1  2.0      M    4548  
  9. 4  1969      1  3.0      F    4548  
Taking a look at the data, we see that it’s relatively simple—it contains the number of births grouped by date and gender. We can start to understand this data a bit more by using a pivot table. Let’s add a decadecolumn, and take a look at male and female births as a function of decade: 
  1. In [6]: births['decade'] = 10 * (births['year'// 10)  
  2.   
  3. In [7]: births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')  
  4. Out[7]:  
  5. gender         F         M  
  6. decade  
  7. 1960     1753634   1846572  
  8. 1970    16263075  17121550  
  9. 1980    18310351  19243452  
  10. 1990    19479454  20420553  
  11. 2000    18229309  19106428  
We immediately see that male births outnumber female births in every decade. To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visualize the total number of births by year (Figure 3-2; see Chapter 4 for a discussion of plotting with Matplotlib): 

  1. In [9]: import matplotlib.pyplot as plt  
  2. In [10]: sns.set()  
  3. In [11]: births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()  

  4.   
  5. In [12]: plt.ylabel('Total births per year')  
  6. Out[12]: Text(00.5'Total births per year')  

  7.   
  8. In [13]: plt.show()  




With a simple pivot table and plot() method, we can immediately see the annual trend in births by gender. 

Further data exploration 
Though this doesn’t necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point. We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th). One easy way to remove these all at once is to cut outliers; we’ll do this via a robust sigma-clipping operation: 
  1. In [18]: quartiles = np.percentile(births['births'], [255075])  
  2. In [19]: mu = quartiles[0]  
  3. In [20]: sig = 0.74 * (quartiles[2] - quartiles[0])  
This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution. With this we can use the query() method (discussed further in “High-Performance Pandas: eval() and query()” on page 208) to filter out rows with births outside these values: 
  1. In [24]: quartiles  
  2. Out[24]: array([4358. , 4814. , 5289.5])  
  3.   
  4. In [25]: births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')  
Next we set the day column to integers; previously it had been a string because some columns in the dataset contained the value 'null': 
  1. In [26]: births['day'] = births['day'].astype(int)  # set 'day' column to integer  
Finally, we can combine the daymonth, and year to create a Date index (see “Working with Time Series” on page 188). This allows us to quickly compute the weekday corresponding to each row: 
  1. In [28]: births.index = pd.to_datetime(10000 * births.year + 100 * births.month + births.day, format='%Y%m%d')  
  2.   
  3. In [29]: births['dayofweek'] = births.index.dayofweek  
Using this we can plot births by weekday for several decades (Figure 3-3): 
  1. import matplotlib.pyplot as plt  
  2. import matplotlib as mpl  
  3. births.pivot_table('births', index='dayofweek',  
  4. columns='decade', aggfunc='mean').plot()  
  5. plt.gca().set_xticklabels(['Mon''Tues''Wed''Thurs''Fri''Sat''Sun'])  
  6. plt.ylabel('mean births by day');  


Apparently births are slightly less common on weekends than on weekdays! 

Another interesting view is to plot the mean number of births by the day of the year. Let’s first group the data by month and day separately: 
  1. In [34]: births_by_date = births.pivot_table('births', [births.index.month, births.index.day])  
  2.   
  3. In [35]: births_by_date.head()  
  4. Out[35]:  
  5.        births  
  6. 1 1  4009.225  
  7.   2  4247.400  
  8.   3  4500.900  
  9.   4  4571.350  
  10.   5  4603.625  
The result is a multi-index over months and days. To make this easily plottable, let’s turn these months and days into a date by associating them with a dummy year variable (making sure to choose a leap year so February 29th is correctly handled!
  1. In [36]: births_by_date.index = [pd.datetime(2012, month, day) for (month, day) in births_by_date.index]  
  2.   
  3. In [37]: births_by_date.head()  
  4. Out[37]:  
  5.               births  
  6. 2012-01-01  4009.225  
  7. 2012-01-02  4247.400  
  8. 2012-01-03  4500.900  
  9. 2012-01-04  4571.350  
  10. 2012-01-05  4603.625  
Focusing on the month and day only, we now have a time series reflecting the average number of births by date of the year. From this, we can use the plot method to plot the data (Figure 3-4). It reveals some interesting trends: 
  1. # Plot the results  
  2. fig, ax = plt.subplots(figsize=(124))  
  3. births_by_date.plot(ax=ax);  


Looking at this short example, you can see that many of the Python and Pandas tools we’ve seen to this point can be combined and used to gain insight from a variety of datasets. We will see some more sophisticated applications of these data manipulations in future sections! 

Vectorized String Operations 
One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that become an essential piece of the type of munging required when one is working with (read: cleaning up) real-world data. In this section, we’ll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet. 

Introducing Pandas String Operations 
We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example: 
In [6]: x = np.array([2, 3, 5, 7, 11, 13])
In [8]: x * 2
Out[8]: array([ 4, 6, 10, 14, 22, 26])

This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you’re stuck using a more verbose loop syntax: 
In [9]: data = ['peter', 'Paul', 'MARY', 'gUIDO']
In [10]: [s.capitalize() for s in data]
Out[10]: ['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing values. For example: 
  1. In [11]: data = ['peter''Paul', None, 'MARY''gUIDO']  
  2. In [12]: [s.capitalize() for s in data]  
  3. ---------------------------------------------------------------------------  
  4. AttributeError                            Traceback (most recent call last)  
  5. 12-bb1b424b25b0> in   


  • ----> 1 [s.capitalize() for s in data]  
  •   
  • 12-bb1b424b25b0> in (.0)  
  • ----> 1 [s.capitalize() for s in data]  
  •   
  • AttributeError: 'NoneType' object has no attribute 'capitalize'  
  • Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the str attribute of Pandas Series and Index objects containing strings. So, for example, suppose we create a Pandas Series with this data: 
    1. In [13]: import pandas as pd  
    2. In [14]: names = pd.Series(data)  
    3. In [16]: names  
    4. Out[16]:  
    5. 0    peter  
    6. 1     Paul  
    7. 2     None  
    8. 3     MARY  
    9. 4    gUIDO  
    10. dtype: object  
    We can now call a single method that will capitalize all the entries, while skipping over any missing values
    1. In [17]: names.str.capitalize()  
    2. Out[17]:  
    3. 0    Peter  
    4. 1     Paul  
    5. 2     None  
    6. 3     Mary  
    7. 4    Guido  
    8. dtype: object  
    Using tab completion on this str attribute will list all the vectorized string methods available to Pandas. 

    Tables of Pandas String Methods 
    If you have a good understanding of string manipulation in Python, most of Pandas’ string syntax is intuitive enough that it’s probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties. The examples in this section use the following series of names: 
    1. In [20]: monte = pd.Series(['Graham Chapman''John Cleese''Terry Gilliam''Eric Idle''Terry Jones''Michael Palin'])  
    Methods similar to Python string methods 
    Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. For the details (whole support methods), you can refer to "Working with Text data". 

    Notice that these have various return values. Some, like lower(), return a series of strings: 
    1. In [21]: monte.str.lower()  
    2. Out[21]:  
    3. 0    graham chapman  
    4. 1       john cleese  
    5. 2     terry gilliam  
    6. 3         eric idle  
    7. 4       terry jones  
    8. 5     michael palin  
    9. dtype: object  
    But some others return numbers: 
    1. In [22]: monte.str.len()  
    2. Out[22]:  
    3. 0    14  
    4. 1    11  
    5. 2    13  
    6. 3     9  
    7. 4    11  
    8. 5    13  
    9. dtype: int64  
    Or Boolean values: 
    1. In [23]: monte.str.startswith('T')  
    2. Out[23]:  
    3. 0    False  
    4. 1    False  
    5. 2     True  
    6. 3    False  
    7. 4     True  
    8. 5    False  
    9. dtype: bool  
    Still others return lists or other compound values for each element: 
    1. In [24]: monte.str.split()  
    2. Out[24]:  
    3. 0    [Graham, Chapman]  
    4. 1       [John, Cleese]  
    5. 2     [Terry, Gilliam]  
    6. 3         [Eric, Idle]  
    7. 4       [Terry, Jones]  
    8. 5     [Michael, Palin]  
    9. dtype: object  
    We’ll see further manipulations of this kind of series-of-lists object as we continue our discussion. 

    Methods using regular expressions 
    In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in re module: 


    With these, you can do a wide range of interesting operations. For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element: 
    1. In [25]: monte.str.extract('([A-za-z]+)')  
    2. Out[25]:  
    3.          0  
    4. 0   Graham  
    5. 1     John  
    6. 2    Terry  
    7. 3     Eric  
    8. 4    Terry  
    9. 5  Michael  
    Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (^) and end-of-string ($) regular expression characters: 
    1. In [26]: monte.str.findall(r'^[^AEIOU].*[^aeiou]$')  
    2. Out[26]:  
    3. 0    [Graham Chapman]  
    4. 1                  []  
    5. 2     [Terry Gilliam]  
    6. 3                  []  
    7. 4       [Terry Jones]  
    8. 5     [Michael Palin]  
    9. dtype: object  
    The ability to concisely apply regular expressions across Series or DataFrame entries opens up many possibilities for analysis and cleaning of data. 

    Miscellaneous methods 
    Finally, there are some miscellaneous methods that enable other convenient operations (see Table 3-5). 


    Vectorized item access and slicing. The get() and slice() operations, in particular, enable vectorized element access from each array. For example, we can get a slice of the first three characters of each array using str.slice(0, 3). Note that this behavior is also available through Python’s normal indexing syntax—for example, df.str.slice(0, 3) is equivalent to df.str[0:3]
    1. In [27]: monte.str[0:3]  
    2. Out[27]:  
    3. 0    Gra  
    4. 1    Joh  
    5. 2    Ter  
    6. 3    Eri  
    7. 4    Ter  
    8. 5    Mic  
    9. dtype: object  
    Indexing via df.str.get(i) and df.str[i] is similar. 

    These get() and slice() methods also let you access elements of arrays returned by slice(). For example, to extract the last name of each entry, we can combine slice() and get()
    1. In [28]: monte.str.split().str.get(-1)  
    2. Out[28]:  
    3. 0    Chapman  
    4. 1     Cleese  
    5. 2    Gilliam  
    6. 3       Idle  
    7. 4      Jones  
    8. 5      Palin  
    9. dtype: object  
    Indicator variables. Another method that requires a bit of extra explanation is the get_dummies() method. This is useful when your data has a column containing some sort of coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A=“born in America,” B=“born in the United Kingdom,” C=“likes cheese,” D=“likes spam”: 
    1. In [29]: full_monte = pd.DataFrame({'name': monte, 'info':['B|C|D''B|D''A|C''B|D''B|C''B|C|D']})  
    2.   
    3. In [30]: full_monte  
    4. Out[30]:  
    5.              name   info  
    6. 0  Graham Chapman  B|C|D  
    7. 1     John Cleese    B|D  
    8. 2   Terry Gilliam    A|C  
    9. 3       Eric Idle    B|D  
    10. 4     Terry Jones    B|C  
    11. 5   Michael Palin  B|C|D  
    The get_dummies() routine lets you quickly split out these indicator variables into a DataFrame
    1. In [31]: full_monte['info'].str.get_dummies('|')  
    2. Out[31]:  
    3.    A  B  C  D  
    4. 0  0  1  1  1  
    5. 1  0  1  0  1  
    6. 2  1  0  1  0  
    7. 3  0  1  0  1  
    8. 4  0  1  1  0  
    9. 5  0  1  1  1  
    With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data. 

    Example: Recipe Database 
    These vectorized string operations become most useful in the process of cleaning up messy, real-world data. Here I’ll walk through an example of that, using an open recipe database compiled from various sources on the Web. Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand

    The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well. As of spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands: 
    # wget https://github.com/sameergarg/scala-elasticsearch/...onf/recipeitems-latest.json.gz
    # gunzip recipeitems-latest.json.gz
    # ls -hl
    total 136M
    -rw-r--r-- 1 root root 136M Dec 8 10:24 recipeitems-latest.json

    The database is in JSON format, so we will try pd.read_json to read it: 
    1. >>> import pandas as pd  
    2. >>> recipes = pd.read_json("recipeitems-latest.json")  
    3. Traceback (most recent call last):  
    4.   File "", line 1, in   
    5.   File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 354, in read_json  
    6.     date_unit).parse()  
    7.   File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 422, in parse  
    8.     self._parse_no_numpy()  
    9.   File "/usr/lib64/python2.7/site-packages/pandas/io/json/json.py", line 639, in _parse_no_numpy  
    10.     loads(json, precise_float=self.precise_float), dtype=None)  
    11. ValueError: Trailing data  
    Oops! We get a ValueError mentioning that there is “trailing data.” Searching for this error on the Internet, it seems that it’s due to using a file in which each line is itself a valid JSON, but the full file is not. Let’s check if this interpretation is true: 
    1. >>> with open('recipeitems-latest.json''r') as f:  
    2. ...     line = f.readline()  
    3. ...  
    4. >>> pd.read_json(line).shape  
    5. (212)  
    Yes, apparently each line is a valid JSON, so we’ll need to string them together. One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with pd.read_json
    1. >>> with open('recipeitems-latest.json''r') as f: # read the entire file into a python array  
    2. ...     # Extract each line  
    3. ...     data = (line.strip() for line in f)  
    4. ...     # Reformat so each line is the element of a list  
    5. ...     data_json = "[{0}]".format(','.join(data))  
    6. ...  
    7. >>> recipes = pd.read_json(data_json) # read the result as a JSON  
    8. >>> recipes.shape  
    9. (17327817)  
    We see there are nearly 200,000 recipes, and 17 columns. Let’s take a look at one row to see what we have: 
    1. >>> recipes.iloc[0]  
    2. _id                              {u'$oid': u'5160756b96cc62079cc2db15'}  
    3. cookTime                                                          PT30M  
    4. creator                                                             NaN  
    5. dateModified                                                        NaN  
    6. datePublished                                                2013-03-11  
    7. description           Late Saturday afternoon, after Marlboro Man ha...  
    8. image                 http://static.thepioneerwoman.com/cooking/file...  
    9. ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...  
    10. name                                    Drop Biscuits and Sausage Gravy  
    11. prepTime                                                          PT10M  
    12. recipeCategory                                                      NaN  
    13. recipeInstructions                                                  NaN  
    14. recipeYield                                                          12  
    15. source                                                  thepioneerwoman  
    16. totalTime                                                           NaN  
    17. ts                                            {u'$date'1365276011104}  
    18. url                   http://thepioneerwoman.com/cooking/2013/03/dro...  
    19. Name: 0, dtype: object  
    There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web. In particular, the ingredient list is in string format; we’re going to have to carefully extract the information we’re interested in. Let’s start by taking a closer look at the ingredients: 
    1. In [10]: recipes.ingredients.str.len().describe()  
    2. Out[10]:  
    3. count    173278.000000  
    4. mean        244.617805  
    5. std         146.705075  
    6. min           0.000000  
    7. 25%         147.000000  
    8. 50%         221.000000  
    9. 75%         314.000000  
    10. max        9067.000000  
    11. Name: ingredients, dtype: float64  
    The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters! Just out of curiosity, let’s see which recipe has the longest ingredient list: 
    1. In [13]: recipes.name[recipes.ingredients.str.len().idxmax()]  
    2. Out[13]: 'Carrot Pineapple Spice & Brownie Layer Cake with Whipped Cream & Cream Cheese Frosting and Marzipan Carrots'  
    That certainly looks like an involved recipe. 

    We can do other aggregate explorations; for example, let’s see how many of the recipes are for breakfast food: 
    1. In [14]: recipes.description.str.contains('[Bb]reakfast').sum()  
    2. Out[14]: 3524  
    Or how many of the recipes list cinnamon as an ingredient: 
    1. In [15]: recipes.ingredients.str.contains('[Cc]innamon').sum()  
    2. Out[15]: 10526  
    We could even look to see whether any recipes misspell the ingredient as “cinamon”: 
    1. In [16]: recipes.ingredients.str.contains('[Cc]inamon').sum()  
    2. Out[16]: 11  
    This is the type of essential data exploration that is possible with Pandas string tools. It is data munging like this that Python really excels at

    A simple recipe recommender 
    given a list of ingredients, find a recipe that uses all those ingredients. While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row. So we will cheat a bit: we’ll start with a list of common ingredients, and simply search to see whether they are in each recipe’s ingredient list. For simplicity, let’s just stick with herbs and spices for the time being: 
    1. In [17]: spice_list = ['salt''pepper''oregano''sage''parsley''rosemary''tarragon''thyme''paprika''cumin']  
    We can then build a Boolean DataFrame consisting of True and False values, indicating whether this ingredient appears in the list: 
    1. In [22]: spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE)) for spice in spice_list))  
    2.   
    3. In [23]: spice_df.head()  
    4. Out[23]:  
    5.     salt  pepper  oregano   sage  parsley  rosemary  tarragon  thyme  paprika  cumin  
    6. 0  False   False    False   True    False     False     False  False    False  False  
    7. 1  False   False    False  False    False     False     False  False    False  False  
    8. 2   True    True    False  False    False     False     False  False    False   True  
    9. 3  False   False    False  False    False     False     False  False    False  False  
    10. 4  False   False    False  False    False     False     False  False    False  False  
    Now, as an example, let’s say we’d like to find a recipe that uses parsleypaprika, and tarragon. We can compute this very quickly using the query() method of DataFrames: 
    1. In [24]: selection = spice_df.query('parsley & paprika & tarragon')  
    2. In [25]: len(selection)  
    3. Out[25]: 10  
    We find only 10 recipes with this combination; let’s use the index returned by this selection to discover the names of the recipes that have this combination: 
    1. In [26]: recipes.name[selection.index]  
    2. Out[26]:  
    3. 2069      All cremat with a Little Gem, dandelion and wa...  
    4. 74964                         Lobster with Thermidor butter  
    5. 93768      Burton's Southern Fried Chicken with White Gravy  
    6. 113926                     Mijo's Slow Cooker Shredded Beef  
    7. 137686                     Asparagus Soup with Poached Eggs  
    8. 140530                                 Fried Oyster Po’boys  
    9. 158475                Lamb shank tagine with herb tabbouleh  
    10. 158486                 Southern fried chicken in buttermilk  
    11. 163175            Fried Chicken Sliders with Pickles + Slaw  
    12. 165243                        Bar Tartine Cauliflower Salad  
    13. Name: name, dtype: object  
    Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we’d like to cook for dinner. This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently. 

    Supplement 
    High-Performance Pandas: eval() and query()

    沒有留言:

    張貼留言

    [Git 常見問題] error: The following untracked working tree files would be overwritten by merge

      Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...