2018年10月25日 星期四

[ Py DS ] Ch3 - Data Manipulation with Pandas (Part2)


Source From Here 
Operating on Data in Pandas 
One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced before. 

Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc. This means that keeping the context of data and combining data from different sources—both potentially error-prone tasks with raw NumPy arrays—become essentially foolproof ones with Pandas. We will additionally see that there are well-defined operations between one-dimensional Series structures and two-dimensional DataFrame structures. 

Ufuncs: Index Preservation 
Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects. Let’s start by defining a simple Series and DataFrame on which to demonstrate this: 
  1. In [23]: import pandas as pd  
  2.   
  3. In [24]: import numpy as np  
  4.   
  5. In [25]: rng = np.random.RandomState(42)  
  6.   
  7. In [26]: ser = pd.Series(rng.randint(0104))  
  8.   
  9. In [27]: ser  
  10. Out[27]:  
  11. 0    6  
  12. 1    3  
  13. 2    7  
  14. 3    4  
  15. dtype: int32  
  16.   
  17. In [28]: df = pd.DataFrame(rng.randint(010, (34)), columns = ['A''B''C''D'])  
  18.   
  19. In [29]: df  
  20. Out[29]:  
  21.    A  B  C  D  
  22. 0  6  9  2  6  
  23. 1  7  4  3  7  
  24. 2  7  2  5  4  
If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved: 
  1. In [30]: np.exp(ser)  
  2. Out[30]:  
  3. 0     403.428793  
  4. 1      20.085537  
  5. 2    1096.633158  
  6. 3      54.598150  
  7. dtype: float64  
  8.   
  9. In [31]: np.sin(df * np.pi / 4)  
  10. Out[31]:  
  11.           A             B         C             D  
  12. 0 -1.000000  7.071068e-01  1.000000 -1.000000e+00  
  13. 1 -0.707107  1.224647e-16  0.707107 -7.071068e-01  
  14. 2 -0.707107  1.000000e+00 -0.707107  1.224647e-16  
UFuncs: Index Alignment 
For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. This is very convenient when you are working with incomplete data, as we’ll see in some of the examples that follow. 

Index alignment in Series 
As an example, suppose we are combining two different data sources, and find only the top three US states by area and the top three US states by population
  1. In [32]: area = pd.Series({'Alaska'1723337'Texas'695662,  
  2.     ...: 'California'423967}, name='area')  
  3.     ...: population = pd.Series({'California'38332521'Texas'26448193,  
  4.     ...: 'New York'19651127}, name='population')  
  5.     ...:  
  6.     ...:  
  7.   
  8. In [33]: population / area  
  9. Out[33]:  
  10. Alaska              NaN  
  11. California    90.413926  
  12. New York            NaN  
  13. Texas         38.018740  
  14. dtype: float64  
The resulting array contains the union of indices of the two input arrays, which we could determine using standard Python set arithmetic on these indices: 
  1. In [35]: area.index | population.index  
  2. Out[35]: Index(['Alaska''California''New York''Texas'], dtype='object')  
Any item for which one or the other does not have an entry is marked with NaN, or “Not a Number,” which is how Pandas marks missing data (see further discussion of missing data in “Handling Missing Data” on page 119). This index matching is implemented this way for any of Python’s built-in arithmetic expressions; any missing values are filled in with NaN by default
  1. In [36]: A = pd.Series([246], index=[012])  
  2. In [37]: B = pd.Series([134], index=[123])  
  3. In [38]: A + B  
  4. Out[38]:  
  5. 0    NaN  
  6. 1    5.0  
  7. 2    9.0  
  8. 3    NaN  
  9. dtype: float64  
If using NaN values is not the desired behavior, we can modify the fill value using appropriate object methods in place of the operators. For example, calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification of the fill value for any elements in A or B that might be missing: 
  1. In [39]: A.add(B, fill_value=0)  
  2. Out[39]:  
  3. 0    2.0  
  4. 1    5.0  
  5. 2    9.0  
  6. 3    4.0  
  7. dtype: float64  
Index alignment in DataFrame 
A similar type of alignment takes place for both columns and indices when you are performing operations on DataFrames: 
  1. In [40]: A = pd.DataFrame(rng.randint(020, (22)), columns=list('AB'))  
  2.   
  3. In [41]: A  
  4. Out[41]:  
  5.    A   B  
  6. 0  1  11  
  7. 1  5   1  
  8.   
  9. In [42]: B = pd.DataFrame(rng.randint(010, (33)), columns=list('BAC'))  
  10.   
  11. In [43]: B  
  12. Out[43]:  
  13.    B  A  C  
  14. 0  4  0  9  
  15. 1  5  8  0  
  16. 2  9  2  6  
  17.   
  18. In [44]: A + B  
  19. Out[44]:  
  20.       A     B   C  
  21. 0   1.0  15.0 NaN  
  22. 1  13.0   6.0 NaN  
  23. 2   NaN   NaN NaN  
Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted. As was the case with Series, we can use the associated object’s arithmetic method and pass any desired fill_value to be used in place of missing entries. Here we’ll fill with the mean of all values in A (which we compute by first stacking the rows of A): 
  1. In [47]: fill = A.stack().mean()  
  2.   
  3. In [48]: fill  
  4. Out[48]: 4.5  
  5.   
  6. In [49]: A.mean()  
  7. Out[49]:  
  8. A    3.0  
  9. B    6.0  
  10. dtype: float64  
  11.   
  12. In [50]: A.add(B, fill_value=fill)  
  13. Out[50]:  
  14.       A     B     C  
  15. 0   1.0  15.0  13.5  
  16. 1  13.0   6.0   4.5  
  17. 2   6.5  13.5  10.5  
Table 3-1 lists Python operators and their equivalent Pandas object methods. 


Ufuncs: Operations Between DataFrame and Series 
When you are performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained. Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array. Consider one common operation, where we find the difference of a two-dimensional array and one of its rows: 
  1. In [51]: A = rng.randint(10, size=(3,4))  
  2. In [52]: A  
  3. Out[52]:  
  4. array([[3824],  
  5.        [2648],  
  6.        [6138]])  
  7.   
  8. In [53]: A - A[0]  
  9. Out[53]:  
  10. array([[ 0,  0,  0,  0],  
  11.        [-1, -2,  2,  4],  
  12.        [ 3, -7,  1,  4]])  
According to NumPy’s broadcasting rules (see “Computation on Arrays: Broadcasting” on page 63), subtraction between a two-dimensional array and one of its rows is applied row-wise. In Pandas, the convention similarly operates row-wise by default: 
  1. In [54]: df = pd.DataFrame(A, columns=list('QRST'))  
  2.   
  3. In [55]: df  
  4. Out[55]:  
  5.    Q  R  S  T  
  6. 0  3  8  2  4  
  7. 1  2  6  4  8  
  8. 2  6  1  3  8  
  9.   
  10. In [56]: df - df.iloc[0]  
  11. Out[56]:  
  12.    Q  R  S  T  
  13. 0  0  0  0  0  
  14. 1 -1 -2  2  4  
  15. 2  3 -7  1  4  
If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the axis keyword
  1. In [59]: df.subtract(df['R'], axis=0)  
  2. Out[59]:  
  3.    Q  R  S  T  
  4. 0 -5  0 -6 -4  
  5. 1 -4  0 -2  2  
  6. 2  5  0  2  7  
Note that these DataFrame/Series operations, like the operations discussed before, will automatically align indices between the two elements: 
  1. In [62]: df.iloc[0]  
  2. Out[62]:  
  3. Q    3  
  4. R    8  
  5. S    2  
  6. T    4  
  7. Name: 0, dtype: int32  
  8.   
  9. In [63]: df.iloc[0, ::2]  
  10. Out[63]:  
  11. Q    3  
  12. S    2  
  13. Name: 0, dtype: int32  
  14.   
  15. In [64]: df - df.iloc[0, ::2]  
  16. Out[64]:  
  17.      Q   R    S   T  
  18. 0  0.0 NaN  0.0 NaN  
  19. 1 -1.0 NaN  2.0 NaN  
  20. 2  3.0 NaN  1.0 NaN  
This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when you are working with heterogeneous and/or misaligned data in raw NumPy arrays. 

Handling Missing Data 
The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways. In this section, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. Here and throughout the book, we’ll refer to missing data in general as nullNaN, or NA values. 

Trade-Offs in Missing Data Conventions 
A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame. Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry. In the masking approach, the mask might be an entirely separate Boolean array, or it 
may involve appropriation of one bit in the data representation to locally indicate the null status of a value; In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with –9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point specification. 

None of these approaches is without trade-offs: use of a separate mask array requires allocation of an additional Boolean array, which adds overhead in both storage and computation. A sentinel value reduces the range of valid values that can be represented, and may require extra (often non-optimized) logic in CPU and GPU arithmetic. Common special values like NaN are not available for all data types. As in most cases where no universally optimal choice exists, different languages and systems use different conventions. For example, the R language uses reserved bit patterns within each data type as sentinel values indicating missing data, while the SciDB system uses an extra byte attached to every cell to indicate a NA state. 

Missing Data in Pandas 
The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does not have a built-in notion of NA values for non floating-point data types. 

Pandas could have followed R’s lead in specifying bit patterns for each individual data type to indicate nullness, but this approach turns out to be rather unwieldy. While R contains four basic data types, NumPy supports far more than this: for example, while R has a single integer type, NumPy supports fourteen basic integer types once you account for available precisions, signedness, and endianness of the encoding. Reserving a specific bit pattern in all available NumPy types would lead to an unwieldy amount of overhead in special-casing various operations for various types, likely even requiring a new fork of the NumPy package. Further, for the smaller data types (such as 8-bit integers), sacrificing a bit to use as a mask will significantly reduce the range of values it can represent. 

NumPy does have support for masked arrays—that is, arrays that have a separate Boolean mask array attached for marking data as “good” or “bad.” Pandas could have derived from this, but the overhead in both storage, computation, and code maintenance makes that an unattractive choice. 

With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floatingpoint NaN value, and the Python None object. This choice has some side effects, as we will see, but in practice ends up being a good compromise in most cases of interest. 

None: Pythonic missing data 
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects): 
  1. In [1]: import numpy as np  
  2. In [2]: import pandas as pd  
  3.   
  4. In [3]: vals1 = np.array([1, None, 34])  
  5. In [4]: vals1  
  6. Out[4]: array([1, None, 34], dtype=object)  
This dtype=object means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects. While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types
  1. In [5]: for dtype in ['object''int']:  
  2.    ...:     print('dtype=', dtype)  
  3.    ...:     %timeit np.arange(1E6, dtype=dtype).sum()  
  4.    ...:     print()  
  5.    ...:  
  6. dtype= object  
  7. 58.6 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  
  8.   
  9. dtype= int  
  10. 2.21 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  
The use of Python objects in an array also means that if you perform aggregations like sum() or min() across an array with a None value, you will generally get an error which reflects the fact that addition between an integer and None is undefined. 

NaN: Missing numerical data 
The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation: 
  1. In [7]: vals2 = np.array([1, np.nan, 34])  
  2.   
  3. In [8]: vals2.dtype  
  4. Out[8]: dtype('float64')  
Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. You should be aware that NaN is a bit like a data virus—it infects any other object it touches. Regardless of the operation, the result of arithmetic with NaN will be another NaN
  1. In [10]: 1 + np.nan  
  2. Out[10]: nan  
  3.   
  4. In [11]: 0 * np.nan  
  5. Out[11]: nan  
Note that this means that aggregates over the values are well defined (i.e., they don’t result in an error) but not always useful: 
  1. In [13]: vals2.sum(), vals2.min(), vals2.max()  
  2. c:\users\johnlee\appdata\local\programs\python\python37\lib\site-packages\numpy\core\_methods.py:32: RuntimeWarning: invalid value encountered in reduce  
  3.   return umr_minimum(a, axis, None, out, keepdims, initial)  
  4. c:\users\johnlee\appdata\local\programs\python\python37\lib\site-packages\numpy\core\_methods.py:28: RuntimeWarning: invalid value encountered in reduce  
  5.   return umr_maximum(a, axis, None, out, keepdims, initial)  
  6. Out[13]: (nan, nan, nan)  
NumPy does provide some special aggregations that will ignore these missing values
  1. In [4]: np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)  
  2. Out[4]: (8.01.04.0)  
Keep in mind that NaN is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types

NaN and None in Pandas 
NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate: 
  1. In [3]: pd.Series([1, np.nan, 2, None])  
  2. Out[3]:  
  3. 0    1.0  
  4. 1    NaN  
  5. 2    2.0  
  6. 3    NaN  
  7. dtype: float64  
For types that don’t have an available sentinel value, Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to np.nan, it will automatically be upcast to a floating-point type to accommodate the NA
  1. In [4]: x = pd.Series(range(2), dtype=int)  
  2. In [5]: x  
  3. Out[5]:  
  4. 0    0  
  5. 1    1  
  6. dtype: int32  
  7.   
  8. In [6]: x[0] = None  
  9.   
  10. In [7]: x  
  11. Out[7]:  
  12. 0    NaN  
  13. 1    1.0  
  14. dtype: float64  
Notice that in addition to casting the integer array to floating point, Pandas automatically converts the None to a NaN value. (Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included.) While this type of magic may feel a bit hackish compared to the more unified approach to NA values in domain-specific languages like R, the Pandas sentinel/casting approach works quite well in practice and in my experience only rarely causes issues. Table 3-2 lists the upcasting conventions in Pandas when NA values are introduced. 


Keep in mind that in Pandas, string data is always stored with an object dtype

Operating on Null Values 
As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are: 
isnull(): 
Detect missing values for an array-like object. This function takes a scalar or array-like object and indictates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

notnull() 
Detect non-missing values for an array-like object. This function takes a scalar or array-like object and indictates whether values are valid (not missing, which is NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

dropna() 
Return a filtered version of the data

fillna() 
Return a copy of the data with missing values filled or imputed

We will conclude this section with a brief exploration and demonstration of these routines. 

Detecting null values 
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example: 
  1. In [3]: data = pd.Series([1, np.nan, 'hello', None])  
  2. In [4]: data.isnull()  
  3. Out[4]:  
  4. 0    False  
  5. 1     True  
  6. 2    False  
  7. 3     True  
  8. dtype: bool  
As mentioned in “Data Indexing and Selection” on page 107, Boolean masks can be used directly as a Series or DataFrame index: 
  1. In [5]: data[data.notnull()]  
  2. Out[5]:  
  3. 0        1  
  4. 2    hello  
  5. dtype: object  
The isnull() and notnull() methods produce similar Boolean results for DataFrames

Dropping null values 
In addition to the masking used before, there are the convenience methods, dropna() (which removes NA values) and fillna() (which fills in NA values). For a Series, the result is straightforward: 
  1. In [6]: data.dropna()  
  2. Out[6]:  
  3. 0        1  
  4. 2    hello  
  5. dtype: object  
For a DataFrame, there are more options. Consider the following DataFrame
  1. In [9]: df = pd.DataFrame([[1, np.nan, 2], [235], [np.nan, 46]])  
  2. In [10]: df  
  3. Out[10]:  
  4.      0    1  2  
  5. 0  1.0  NaN  2  
  6. 1  2.0  3.0  5  
  7. 2  NaN  4.0  6  
We cannot drop single values from a DataFrame; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so dropna() gives a number of options for a DataFrame

By default, dropna() will drop all rows in which any null value is present: 
  1. In [11]: df.dropna()  
  2. Out[11]:  
  3.      0    1  2  
  4. 1  2.0  3.0  5  
Alternatively, you can drop NA values along a different axis; axis=1 drops all columns containing a null value: 
  1. In [12]: df.dropna(axis='columns')  
  2. Out[12]:  
  3.    2  
  4. 0  2  
  5. 1  5  
  6. 2  6  
But this drops some good data as well; you might rather be interested in dropping rows or columns with all NA values, or a majority of NA values. This can be specified through the how or thresh parameters, which allow fine control of the number of nulls to allow through. 

The default is how='any', such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify how='all', which will only drop rows/columns that are all null values: 
  1. In [13]: df[3] = np.nan  
  2. In [14]: df  
  3. Out[14]:  
  4.      0    1  2   3  
  5. 0  1.0  NaN  2 NaN  
  6. 1  2.0  3.0  5 NaN  
  7. 2  NaN  4.0  6 NaN  
  8.   
  9. In [15]: df.dropna(axis='columns', how='all')  
  10. Out[15]:  
  11.      0    1  2  
  12. 0  1.0  NaN  2  
  13. 1  2.0  3.0  5  
  14. 2  NaN  4.0  6  
For finer-grained control, the thresh parameter lets you specify a minimum number of non-null values for the row/column to be kept: 
  1. In [16]: df.dropna(axis='rows', thresh=3)  
  2. Out[16]:  
  3.      0    1  2   3  
  4. 1  2.0  3.0  5 NaN  
Here the first and last row have been dropped, because they contain only two nonnull values. 

Filling null values 
Sometimes rather than dropping NA values, you’d rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced

Consider the following Series
  1. In [4]: data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))  
  2.   
  3. In [5]: data  
  4. Out[5]:  
  5. a    1.0  
  6. b    NaN  
  7. c    2.0  
  8. d    NaN  
  9. e    3.0  
  10. dtype: float64  
We can fill NA entries with a single value, such as zero: 
  1. In [6]: data.fillna(0)  
  2. Out[6]:  
  3. a    1.0  
  4. b    0.0  
  5. c    2.0  
  6. d    0.0  
  7. e    3.0  
  8. dtype: float64  
We can specify a forward-fill to propagate the previous value forward: 
  1. In [7]: data.fillna(method='ffill')  
  2. Out[7]:  
  3. a    1.0  
  4. b    1.0  
  5. c    2.0  
  6. d    2.0  
  7. e    3.0  
  8. dtype: float64  
Or we can specify a back-fill to propagate the next values backward: 
  1. In [8]: data.fillna(method='bfill')  
  2. Out[8]:  
  3. a    1.0  
  4. b    2.0  
  5. c    2.0  
  6. d    3.0  
  7. e    3.0  
  8. dtype: float64  
For DataFrames, the options are similar, but we can also specify an axis along which the fills take place: 
  1. In [11]: df  
  2. Out[11]:  
  3.      0    1  2  
  4. 0  1.0  NaN  2  
  5. 1  2.0  3.0  5  
  6. 2  NaN  4.0  6  
  7.   
  8. In [12]: df.fillna(method='ffill', axis=1)  
  9. Out[12]:  
  10.      0    1    2  
  11. 0  1.0  1.0  2.0  
  12. 1  2.0  3.0  5.0  
  13. 2  NaN  4.0  6.0  
Notice that if a previous value is not available during a forward fill, the NA value remains

Hierarchical Indexing 
Up to this point we’ve been focused primarily on one-dimensional and twodimensional data, stored in Pandas Series and DataFrame objects, respectively. Often it is useful to go beyond this and store higher-dimensional data—that is, data indexed by more than one or two keys. While Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and four-dimensional data (see “Panel Data” on page 141), a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects. 

In this section, we’ll explore the direct creation of MultiIndex objects; considerations around indexing, slicing, and computing statistics across multiply indexed data; and useful routines for converting between simple and hierarchically indexed representations of your data. We begin with the standard imports: 
  1. In [13]: import pandas as pd  
  2. In [14]: import numpy as np  
A Multiply Indexed Series 
Let’s start by considering how we might represent two-dimensional data within a one-dimensional Series. For concreteness, we will consider a series of data where each point has a character and numerical key

The bad way 
Suppose you would like to track data about states from two different years. Using the Pandas tools we’ve already covered, you might be tempted to simply use Python tuples as keys: 
  1. In [15]: index = [('California'2000), ('California'2010), ('New York'2000), ('New York'2010), ('Texas'2000), ('Texas'2010)]  
  2. In [16]: population = [338716483725395618976457193781022085182025145561]  
  3. In [17]: pop = pd.Series(population, index=index)  
  4.   
  5. In [18]: pop  
  6. Out[18]:  
  7. (California, 2000)    33871648  
  8. (California, 2010)    37253956  
  9. (New York, 2000)      18976457  
  10. (New York, 2010)      19378102  
  11. (Texas, 2000)         20851820  
  12. (Texas, 2010)         25145561  
  13. dtype: int64  
With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index: 
  1. In [19]: pop[('California'2010):('Texas'2000)]  
  2. Out[19]:  
  3. (California, 2010)    37253956  
  4. (New York, 2000)      18976457  
  5. (New York, 2010)      19378102  
  6. (Texas, 2000)         20851820  
  7. dtype: int64  
But the convenience ends there. For example, if you need to select all values from 2010, you’ll need to do some messy (and potentially slow) munging to make it happen: 
  1. In [21]: pop[[i for i in pop.index if i[1] == 2010]]  
  2. Out[21]:  
  3. (California, 2010)    37253956  
  4. (New York, 2010)      19378102  
  5. (Texas, 2010)         25145561  
  6. dtype: int64  
This produces the desired result, but is not as clean (or as efficient for large datasets) as the slicing syntax we’ve grown to love in Pandas. 

The better way: Pandas MultiIndex 
Fortunately, Pandas provides a better way. Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have. We can create a multi-index from the tuples as follows: 
  1. In [22]: index = pd.MultiIndex.from_tuples(index)  
  2.   
  3. In [23]: index  
  4. Out[23]:  
  5. MultiIndex(levels=[['California''New York''Texas'], [20002010]],  
  6.            labels=[[001122], [010101]])  
Notice that the MultiIndex contains multiple levels of indexing—in this case, the state names and the years, as well as multiple labels for each data point which encode these levels. If we reindex our series with this MultiIndex, we see the hierarchical representation of the data: 
  1. In [24]: pop = pop.reindex(index)  
  2.   
  3. In [25]: pop  
  4. Out[25]:  
  5. California  2000    33871648  
  6.             2010    37253956  
  7. New York    2000    18976457  
  8.             2010    19378102  
  9. Texas       2000    20851820  
  10.             2010    25145561  
  11. dtype: int64  
Here the first two columns of the Series representation show the multiple index values, while the third column shows the data. Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it. 

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation: 
  1. In [26]: pop[:, 2010]  
  2. Out[26]:  
  3. California    37253956  
  4. New York      19378102  
  5. Texas         25145561  
  6. dtype: int64  
The result is a singly indexed array with just the keys we’re interested in. This syntax is much more convenient (and the operation is much more efficient!) than the homespun tuple-based multi-indexing solution that we started with. We’ll now further discuss this sort of indexing operation on hierarchically indexed data. 

MultiIndex as extra dimension 
You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiplyindexed Series into a conventionally indexed DataFrame
  1. In [27]: pop_df = pop.unstack()  
  2. In [28]: pop_df  
  3. Out[28]:  
  4.                 2000      2010  
  5. California  33871648  37253956  
  6. New York    18976457  19378102  
  7. Texas       20851820  25145561  
Naturally, the stack() method provides the opposite operation: 
  1. In [29]: pop_df.stack()  
  2. Out[29]:  
  3. California  2000    33871648  
  4.             2010    37253956  
  5. New York    2000    18976457  
  6.             2010    19378102  
  7. Texas       2000    20851820  
  8.             2010    25145561  
  9. dtype: int64  
Seeing this, you might wonder why would we would bother with hierarchical indexing at all. The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series, we can also use it to represent data of three or more dimensions in a Series or DataFrameEach extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18); with a MultiIndex this is as easy as adding another column to the DataFrame
  1. In [31]: pop_df = pd.DataFrame({'total':pop, 'under18':[926708992840944687374431803359063016879014]})  
  2. In [32]: pop_df  
  3. Out[32]:  
  4.                     total  under18  
  5. California 2000  33871648  9267089  
  6.            2010  37253956  9284094  
  7. New York   2000  18976457  4687374  
  8.            2010  19378102  4318033  
  9. Texas      2000  20851820  5906301  
  10.            2010  25145561  6879014  
In addition, all the ufuncs and other functionality discussed in “Operating on Data in Pandas” on page 115 work with hierarchical indices as well. Here we compute the fraction of people under 18 by year, given the above data
  1. In [33]: f_u18 = pop_df['under18'] / pop_df['total']  
  2. In [34]: f_u18.unstack()  
  3. Out[34]:  
  4.                 2000      2010  
  5. California  0.273594  0.249211  
  6. New York    0.247010  0.222831  
  7. Texas       0.283251  0.273568  
This allows us to easily and quickly manipulate and explore even high-dimensional data 

Methods of MultiIndex Creation 
The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example: 
  1. In [35]: df = pd.DataFrame(np.random.rand(4,2), index=[['a''a''b''b'], [1212]], columns=['data1''data2'])  
  2. In [36]: df  
  3. Out[36]:  
  4.         data1     data2  
  5. 1  0.299141  0.084822  
  6.   2  0.459519  0.774651  
  7. 1  0.952114  0.258307  
  8.   2  0.794663  0.276138  
The work of creating the MultiIndex is done in the background. Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default: 
  1. In [39]: data = {('California'2000): 12345, ('California'2010): 12346, ('Texas'2000): 67890, ('Texas'2010): 67891, ('New York'2000): 1, ('New York'2010): 2}  
  2.   
  3. In [40]: pd.Series(data)  
  4. Out[40]:  
  5. California  2000    12345  
  6.             2010    12346  
  7. Texas       2000    67890  
  8.             2010    67891  
  9. New York    2000        1  
  10.             2010        2  
  11. dtype: int64  
Nevertheless, it is sometimes useful to explicitly create a MultiIndex; we’ll see a couple of these methods here. 

Explicit MultiIndex constructors 
For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd.MultiIndex. For example, as we did before, you can construct the MultiIndex from a simple list of arrays, giving the index values within each level: 
  1. In [41]: pd.MultiIndex.from_arrays([['a''a''b''b'], [1212]])  
  2. Out[41]:  
  3. MultiIndex(levels=[['a''b'], [12]],  
  4.            labels=[[0011], [0101]])  
You can construct it from a list of tuples, giving the multiple index values of each point: 
  1. In [42]: pd.MultiIndex.from_tuples([('a'1), ('a'2), ('b'1), ('b'2)])  
  2. Out[42]:  
  3. MultiIndex(levels=[['a''b'], [12]],  
  4.            labels=[[0011], [0101]])  
You can even construct it from a Cartesian product of single indices: 
  1. In [43]: pd.MultiIndex.from_product([['a''b'], [12]])  
  2. Out[43]:  
  3. MultiIndex(levels=[['a''b'], [12]],  
  4.            labels=[[0011], [0101]])  
You can pass any of these objects as the index argument when creating a Series or DataFrame, or to the reindex method of an existing Series or DataFrame

MultiIndex level names 
Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish this by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact: 
  1. In [9]: pop  
  2. Out[9]:  
  3. California  2000    33871648  
  4.             2010    37253956  
  5. New York    2000    18976457  
  6.             2010    19378102  
  7. Texas       2000    20851820  
  8.             2010    25145561  
  9. dtype: int64  
  10.   
  11. In [10]: pop.index.names = ['state''year']  
  12.   
  13. In [11]: pop  
  14. Out[11]:  
  15. state       year  
  16. California  2000    33871648  
  17.             2010    37253956  
  18. New York    2000    18976457  
  19.             2010    19378102  
  20. Texas       2000    20851820  
  21.             2010    25145561  
  22. dtype: int64  
With more involved datasets, this can be a useful way to keep track of the meaning of various index values. 

MultiIndex for columns 
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data: 
  1. In [12]: index = pd.MultiIndex.from_product([[20132014], [12]], names=['year''visit'])  
  2.   
  3. In [13]: columns = pd.MultiIndex.from_product([['Bob''Guido''Sue'], ['HR''Temp']], names=['subject''type'])  
  4.   
  5. In [14]: data = np.round(np.random.randn(46), 1)  # mock some data  
  6.   
  7. In [15]: data[:, ::2] *= 10  
  8.   
  9. In [16]: data += 37  
  10.   
  11. In [17]: data  
  12. Out[17]:  
  13. array([[33. , 35. , 33. , 38.545. , 36.3],  
  14.        [41. , 38.141. , 37. , 36. , 36.3],  
  15.        [31. , 36.839. , 37.926. , 36.9],  
  16.        [60. , 36.141. , 37.439. , 38.3]])  
  17.   
  18. In [18]: index  
  19. Out[18]:  
  20. MultiIndex(levels=[[20132014], [12]],  
  21.            labels=[[0011], [0101]],  
  22.            names=['year''visit'])  
  23.   
  24. In [19]: health_data = pd.DataFrame(data, index=index, columns=columns)  
  25.   
  26. In [20]: health_data  
  27. Out[20]:  
  28. subject      Bob       Guido         Sue  
  29. type          HR  Temp    HR  Temp    HR  Temp  
  30. year visit  
  31. 2013 1      33.0  35.0  33.0  38.5  45.0  36.3  
  32.      2      41.0  38.1  41.0  37.0  36.0  36.3  
  33. 2014 1      31.0  36.8  39.0  37.9  26.0  36.9  
  34.      2      60.0  36.1  41.0  37.4  39.0  38.3  
For complicated records containing multiple labeled measurements across multiple times for many subjects (people, countries, cities, etc.), use of hierarchical rows and columns can be extremely convenient! 

Indexing and Slicing a MultiIndex 
Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We’ll first look at indexing multiply indexed Series, and then multiply indexed DataFrames

Multiply indexed Series 
Consider the multiply indexed Series of state populations we saw earlier: 
  1. In [21]: pop  
  2. Out[21]:  
  3. state       year  
  4. California  2000    33871648  
  5.             2010    37253956  
  6. New York    2000    18976457  
  7.             2010    19378102  
  8. Texas       2000    20851820  
  9.             2010    25145561  
  10. dtype: int64  
We can access single elements by indexing with multiple terms: 
  1. In [22]: pop['California'2000]  
  2. Out[22]: 33871648  
The MultiIndex also supports partial indexing, or indexing just one of the levels in the index. The result is another Series, with the lower-level indices maintained: 
  1. In [23]: pop['California']  
  2. Out[23]:  
  3. year  
  4. 2000    33871648  
  5. 2010    37253956  
  6. dtype: int64  
Partial slicing is available as well, as long as the MultiIndex is sorted (see discussion in “Sorted and unsorted indices” on page 137): 
  1. In [24]: pop.loc['California':'New York']  
  2. Out[24]:  
  3. state       year  
  4. California  2000    33871648  
  5.             2010    37253956  
  6. New York    2000    18976457  
  7.             2010    19378102  
  8. dtype: int64  
With sorted indices, we can perform partial indexing on lower levels by passing an empty slice in the first index: 
  1. In [25]: pop[:, 2000]  
  2. Out[25]:  
  3. state  
  4. California    33871648  
  5. New York      18976457  
  6. Texas         20851820  
  7. dtype: int64  
Other types of indexing and selection (discussed in “Data Indexing and Selection” on page 107) work as well; for example, selection based on Boolean masks: 
  1. In [26]: pop[pop > 22000000]  
  2. Out[26]:  
  3. state       year  
  4. California  2000    33871648  
  5.             2010    37253956  
  6. Texas       2010    25145561  
  7. dtype: int64  
Selection based on fancy indexing also works: 
  1. In [27]: pop[['California''Texas']]  
  2. Out[27]:  
  3. state       year  
  4. California  2000    33871648  
  5.             2010    37253956  
  6. Texas       2000    20851820  
  7.             2010    25145561  
  8. dtype: int64  
Multiply indexed DataFrames 
A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical DataFrame from before: 
  1. In [28]: health_data  
  2. Out[28]:  
  3. subject      Bob       Guido         Sue  
  4. type          HR  Temp    HR  Temp    HR  Temp  
  5. year visit  
  6. 2013 1      33.0  35.0  33.0  38.5  45.0  36.3  
  7.      2      41.0  38.1  41.0  37.0  36.0  36.3  
  8. 2014 1      31.0  36.8  39.0  37.9  26.0  36.9  
  9.      2      60.0  36.1  41.0  37.4  39.0  38.3  
Remember that columns are primary in a DataFrame, and the syntax used for multiply indexed Series applies to the columns. For example, we can recover Guido’s heart rate data with a simple operation: 
  1. In [29]: health_data['Guido''HR']  
  2. Out[29]:  
  3. year  visit  
  4. 2013  1        33.0  
  5.       2        41.0  
  6. 2014  1        39.0  
  7.       2        41.0  
  8. Name: (Guido, HR), dtype: float64  
Also, as with the single-index case, we can use the lociloc, and ix indexers introduced in “Data Indexing and Selection” on page 107. For example: 
  1. In [30]: health_data.iloc[:2, :2]  
  2. Out[30]:  
  3. subject      Bob  
  4. type          HR  Temp  
  5. year visit  
  6. 2013 1      33.0  35.0  
  7.      2      41.0  38.1  
These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example: 
  1. In [31]: health_data.loc[:, ('Bob''HR')]  
  2. Out[31]:  
  3. year  visit  
  4. 2013  1        33.0  
  5.       2        41.0  
  6. 2014  1        31.0  
  7.       2        60.0  
  8. Name: (Bob, HR), dtype: float64  
Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error: 
  1. In [32]: health_data.loc[(:, 1), (:, 'HR')]  
  2.   File "", line 1  
  3.     health_data.loc[(:, 1), (:, 'HR')]  
  4.                      ^  
  5. SyntaxError: invalid syntax  
You could get around this by building the desired slice explicitly using Python’s builtin slice() function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation. For example: 
  1. In [34]: health_data.loc[idx[:, 1], idx[:, 'HR']]  
  2. Out[34]:  
  3. subject      Bob Guido   Sue  
  4. type          HR    HR    HR  
  5. year visit  
  6. 2013 1      33.0  33.0  45.0  
  7. 2014 1      31.0  39.0  26.0  
There are so many ways to interact with data in multiply indexed Series and DataFrames, and as with many tools in this book the best way to become familiar with them is to try them out! 

Rearranging Multi-Indices 
One of the keys to working with multiply indexed data is knowing how to effectively transform the data. There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations. We saw a brief example of this in the stack() and unstack() methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we’ll explore them here. 

Sorted and unsorted indices 
Earlier, we briefly mentioned a caveat, but we should emphasize it more here. Many of the MultiIndex slicing operations will fail if the index is not sorted. Let’s take a look at this here. 

We’ll start by creating some simple multiply indexed data where the indices are not lexographically sorted: 
  1. In [3]: index = pd.MultiIndex.from_product([['a''c''b'], [12]])  
  2. In [4]: data = pd.Series(np.random.rand(6), index=index)  
  3. In [5]: data.index.names = ['char''int']  
  4.   
  5. In [6]: data  
  6. Out[6]:  
  7. char  int  
  8. a     1      0.501981  
  9.       2      0.594293  
  10. c     1      0.779701  
  11.       2      0.517394  
  12. b     1      0.699419  
  13.       2      0.620710  
  14. dtype: float64  
If we try to take a partial slice of this index, it will result in an error: 
  1. In [7]: try:  
  2.    ...:     data['a':'b']  
  3.    ...: except KeyError as e:  
  4.    ...:     print(type(e))  
  5.    ...:     print(e)  
  6.    ...:  
  7. <class 'pandas.errors.UnsortedIndexError'>  
  8. 'Key length (1) was greater than MultiIndex lexsort depth (0)'  
Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted. For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted (i.e., lexographical) order. Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index() and sortlevel() methods of the DataFrame. We’ll use the simplest, sort_index(), here: 
  1. In [6]: data = data.sort_index()  
  2. In [7]: data  
  3. Out[7]:  
  4. char  int  
  5. a     1      0.164083  
  6.       2      0.890591  
  7. b     1      0.461503  
  8.       2      0.991245  
  9. c     1      0.750072  
  10.       2      0.202940  
  11. dtype: float64  
With the index sorted in this way, partial slicing will work as expected: 
  1. In [8]: data['a':'b']  
  2. Out[8]:  
  3. char  int  
  4. a     1      0.164083  
  5.       2      0.890591  
  6. b     1      0.461503  
  7.       2      0.991245  
  8. dtype: float64  
Stacking and unstacking indices 
As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use: 
  1. In [20]: pop.unstack(level=0)  
  2. Out[20]:  
  3.       California  New York     Texas  
  4. 2000    33871648  18976457  20851820  
  5. 2010    37253956  19378102  25145561  
  6.   
  7. In [21]: pop.unstack(level=1)  
  8. Out[21]:  
  9.                 2000      2010  
  10. California  33871648  37253956  
  11. New York    18976457  19378102  
  12. Texas       20851820  25145561  
The opposite of unstack() is stack(), which here can be used to recover the original series: 
  1. In [26]: pop.unstack().stack()  
  2. Out[26]:  
  3. California  2000    33871648  
  4.             2010    37253956  
  5. New York    2000    18976457  
  6.             2010    19378102  
  7. Texas       2000    20851820  
  8.             2010    25145561  
  9. dtype: int64  
Index setting and resetting 
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation: 
  1. In [27]: pop_flat = pop.reset_index(name='population')  
  2. In [38]: pop_flat  
  3. Out[38]:  
  4.         state  year  population  
  5. 0  California  2000    33871648  
  6. 1  California  2010    37253956  
  7. 2    New York  2000    18976457  
  8. 3    New York  2010    19378102  
  9. 4       Texas  2000    20851820  
  10. 5       Texas  2010    25145561  
  11.   
  12. In [29]: pop_flat.__class__  
  13. Out[29]: pandas.core.frame.DataFrame  
Often when you are working with data in the real world, the raw input data looks like this and it’s useful to build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame
  1. In [37]: pop_flat.set_index(['state''year'])  
  2. Out[37]:  
  3.                  population  
  4. state      year  
  5. California 2000    33871648  
  6.            2010    37253956  
  7. New York   2000    18976457  
  8.            2010    19378102  
  9. Texas      2000    20851820  
  10.            2010    25145561  
In practice, I find this type of reindexing to be one of the more useful patterns when I encounter real-world datasets. 

Data Aggregations on Multi-Indices 
We’ve previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on. For example, let’s return to our health data: 
  1. In [48]: health_data  
  2. Out[48]:  
  3. subject      Bob       Guido         Sue  
  4. type          HR  Temp    HR  Temp    HR  Temp  
  5. year visit  
  6. 2013 1      53.0  38.2  49.0  38.0  39.0  38.1  
  7.      2      45.0  37.6  21.0  38.0  35.0  37.6  
  8. 2014 1      37.0  37.5  44.0  36.8  33.0  38.9  
  9.      2      32.0  37.4  35.0  35.9  35.0  37.4  
Perhaps we’d like to average out the measurements in the two visits each year. We can do this by naming the index level we’d like to explore, in this case the year: 
  1. In [49]: data_mean = health_data.mean(level='year')  
  2. In [50]: data_mean  
  3. Out[50]:  
  4. subject   Bob        Guido          Sue  
  5. type       HR   Temp    HR   Temp    HR   Temp  
  6. year  
  7. 2013     49.0  37.90  35.0  38.00  37.0  37.85  
  8. 2014     34.5  37.45  39.5  36.35  34.0  38.15  
By further making use of the axis keyword, we can take the mean among levels on the columns as well: 
  1. In [51]: data_mean.mean(axis=1, level='type')  
  2. Out[51]:  
  3. type         HR       Temp  
  4. year  
  5. 2013  40.333333  37.916667  
  6. 2014  36.000000  37.316667  
Thus in two lines, we’ve been able to find the average heart rate and temperature measured among all subjects in all visits each year. This syntax is actually a shortcut to the GroupBy functionality, which we will discuss in “Aggregation and Grouping” on page 158. While this is a toy example, many real-world datasets have similar hierarchical structure.

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...