Source From Here
Operating on Data in Pandas
One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.). Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced before.
Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc. This means that keeping the context of data and combining data from different sources—both potentially error-prone tasks with raw NumPy arrays—become essentially foolproof ones with Pandas. We will additionally see that there are well-defined operations between one-dimensional Series structures and two-dimensional DataFrame structures.
Ufuncs: Index Preservation
Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series and DataFrame objects. Let’s start by defining a simple Series and DataFrame on which to demonstrate this:
- In [23]: import pandas as pd
- In [24]: import numpy as np
- In [25]: rng = np.random.RandomState(42)
- In [26]: ser = pd.Series(rng.randint(0, 10, 4))
- In [27]: ser
- Out[27]:
- 0 6
- 1 3
- 2 7
- 3 4
- dtype: int32
- In [28]: df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns = ['A', 'B', 'C', 'D'])
- In [29]: df
- Out[29]:
- A B C D
- 0 6 9 2 6
- 1 7 4 3 7
- 2 7 2 5 4
- In [30]: np.exp(ser)
- Out[30]:
- 0 403.428793
- 1 20.085537
- 2 1096.633158
- 3 54.598150
- dtype: float64
- In [31]: np.sin(df * np.pi / 4)
- Out[31]:
- A B C D
- 0 -1.000000 7.071068e-01 1.000000 -1.000000e+00
- 1 -0.707107 1.224647e-16 0.707107 -7.071068e-01
- 2 -0.707107 1.000000e+00 -0.707107 1.224647e-16
For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. This is very convenient when you are working with incomplete data, as we’ll see in some of the examples that follow.
Index alignment in Series
As an example, suppose we are combining two different data sources, and find only the top three US states by area and the top three US states by population:
- In [32]: area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
- ...: 'California': 423967}, name='area')
- ...: population = pd.Series({'California': 38332521, 'Texas': 26448193,
- ...: 'New York': 19651127}, name='population')
- ...:
- ...:
- In [33]: population / area
- Out[33]:
- Alaska NaN
- California 90.413926
- New York NaN
- Texas 38.018740
- dtype: float64
- In [35]: area.index | population.index
- Out[35]: Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
- In [36]: A = pd.Series([2, 4, 6], index=[0, 1, 2])
- In [37]: B = pd.Series([1, 3, 4], index=[1, 2, 3])
- In [38]: A + B
- Out[38]:
- 0 NaN
- 1 5.0
- 2 9.0
- 3 NaN
- dtype: float64
- In [39]: A.add(B, fill_value=0)
- Out[39]:
- 0 2.0
- 1 5.0
- 2 9.0
- 3 4.0
- dtype: float64
A similar type of alignment takes place for both columns and indices when you are performing operations on DataFrames:
- In [40]: A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
- In [41]: A
- Out[41]:
- A B
- 0 1 11
- 1 5 1
- In [42]: B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
- In [43]: B
- Out[43]:
- B A C
- 0 4 0 9
- 1 5 8 0
- 2 9 2 6
- In [44]: A + B
- Out[44]:
- A B C
- 0 1.0 15.0 NaN
- 1 13.0 6.0 NaN
- 2 NaN NaN NaN
- In [47]: fill = A.stack().mean()
- In [48]: fill
- Out[48]: 4.5
- In [49]: A.mean()
- Out[49]:
- A 3.0
- B 6.0
- dtype: float64
- In [50]: A.add(B, fill_value=fill)
- Out[50]:
- A B C
- 0 1.0 15.0 13.5
- 1 13.0 6.0 4.5
- 2 6.5 13.5 10.5
Ufuncs: Operations Between DataFrame and Series
When you are performing operations between a DataFrame and a Series, the index and column alignment is similarly maintained. Operations between a DataFrame and a Series are similar to operations between a two-dimensional and one-dimensional NumPy array. Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:
- In [51]: A = rng.randint(10, size=(3,4))
- In [52]: A
- Out[52]:
- array([[3, 8, 2, 4],
- [2, 6, 4, 8],
- [6, 1, 3, 8]])
- In [53]: A - A[0]
- Out[53]:
- array([[ 0, 0, 0, 0],
- [-1, -2, 2, 4],
- [ 3, -7, 1, 4]])
- In [54]: df = pd.DataFrame(A, columns=list('QRST'))
- In [55]: df
- Out[55]:
- Q R S T
- 0 3 8 2 4
- 1 2 6 4 8
- 2 6 1 3 8
- In [56]: df - df.iloc[0]
- Out[56]:
- Q R S T
- 0 0 0 0 0
- 1 -1 -2 2 4
- 2 3 -7 1 4
- In [59]: df.subtract(df['R'], axis=0)
- Out[59]:
- Q R S T
- 0 -5 0 -6 -4
- 1 -4 0 -2 2
- 2 5 0 2 7
- In [62]: df.iloc[0]
- Out[62]:
- Q 3
- R 8
- S 2
- T 4
- Name: 0, dtype: int32
- In [63]: df.iloc[0, ::2]
- Out[63]:
- Q 3
- S 2
- Name: 0, dtype: int32
- In [64]: df - df.iloc[0, ::2]
- Out[64]:
- Q R S T
- 0 0.0 NaN 0.0 NaN
- 1 -1.0 NaN 2.0 NaN
- 2 3.0 NaN 1.0 NaN
Handling Missing Data
The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing. To make matters even more complicated, different data sources may indicate missing data in different ways. In this section, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python. Here and throughout the book, we’ll refer to missing data in general as null, NaN, or NA values.
Trade-Offs in Missing Data Conventions
A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame. Generally, they revolve around one of two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry. In the masking approach, the mask might be an entirely separate Boolean array, or it
may involve appropriation of one bit in the data representation to locally indicate the null status of a value; In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with –9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point specification.
None of these approaches is without trade-offs: use of a separate mask array requires allocation of an additional Boolean array, which adds overhead in both storage and computation. A sentinel value reduces the range of valid values that can be represented, and may require extra (often non-optimized) logic in CPU and GPU arithmetic. Common special values like NaN are not available for all data types. As in most cases where no universally optimal choice exists, different languages and systems use different conventions. For example, the R language uses reserved bit patterns within each data type as sentinel values indicating missing data, while the SciDB system uses an extra byte attached to every cell to indicate a NA state.
Missing Data in Pandas
The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does not have a built-in notion of NA values for non floating-point data types.
Pandas could have followed R’s lead in specifying bit patterns for each individual data type to indicate nullness, but this approach turns out to be rather unwieldy. While R contains four basic data types, NumPy supports far more than this: for example, while R has a single integer type, NumPy supports fourteen basic integer types once you account for available precisions, signedness, and endianness of the encoding. Reserving a specific bit pattern in all available NumPy types would lead to an unwieldy amount of overhead in special-casing various operations for various types, likely even requiring a new fork of the NumPy package. Further, for the smaller data types (such as 8-bit integers), sacrificing a bit to use as a mask will significantly reduce the range of values it can represent.
NumPy does have support for masked arrays—that is, arrays that have a separate Boolean mask array attached for marking data as “good” or “bad.” Pandas could have derived from this, but the overhead in both storage, computation, and code maintenance makes that an unattractive choice.
With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floatingpoint NaN value, and the Python None object. This choice has some side effects, as we will see, but in practice ends up being a good compromise in most cases of interest.
None: Pythonic missing data
The first sentinel value used by Pandas is None, a Python singleton object that is often used for missing data in Python code. Because None is a Python object, it cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object' (i.e., arrays of Python objects):
- In [1]: import numpy as np
- In [2]: import pandas as pd
- In [3]: vals1 = np.array([1, None, 3, 4])
- In [4]: vals1
- Out[4]: array([1, None, 3, 4], dtype=object)
- In [5]: for dtype in ['object', 'int']:
- ...: print('dtype=', dtype)
- ...: %timeit np.arange(1E6, dtype=dtype).sum()
- ...: print()
- ...:
- dtype= object
- 58.6 ms ± 267 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
- dtype= int
- 2.21 ms ± 28.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
NaN: Missing numerical data
The other missing data representation, NaN (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
- In [7]: vals2 = np.array([1, np.nan, 3, 4])
- In [8]: vals2.dtype
- Out[8]: dtype('float64')
- In [10]: 1 + np.nan
- Out[10]: nan
- In [11]: 0 * np.nan
- Out[11]: nan
- In [13]: vals2.sum(), vals2.min(), vals2.max()
- c:\users\johnlee\appdata\local\programs\python\python37\lib\site-packages\numpy\core\_methods.py:32: RuntimeWarning: invalid value encountered in reduce
- return umr_minimum(a, axis, None, out, keepdims, initial)
- c:\users\johnlee\appdata\local\programs\python\python37\lib\site-packages\numpy\core\_methods.py:28: RuntimeWarning: invalid value encountered in reduce
- return umr_maximum(a, axis, None, out, keepdims, initial)
- Out[13]: (nan, nan, nan)
- In [4]: np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
- Out[4]: (8.0, 1.0, 4.0)
NaN and None in Pandas
NaN and None both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:
- In [3]: pd.Series([1, np.nan, 2, None])
- Out[3]:
- 0 1.0
- 1 NaN
- 2 2.0
- 3 NaN
- dtype: float64
- In [4]: x = pd.Series(range(2), dtype=int)
- In [5]: x
- Out[5]:
- 0 0
- 1 1
- dtype: int32
- In [6]: x[0] = None
- In [7]: x
- Out[7]:
- 0 NaN
- 1 1.0
- dtype: float64
Keep in mind that in Pandas, string data is always stored with an object dtype.
Operating on Null Values
As we have seen, Pandas treats None and NaN as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:
* isnull():
* notnull()
* dropna()
* fillna()
We will conclude this section with a brief exploration and demonstration of these routines.
Detecting null values
Pandas data structures have two useful methods for detecting null data: isnull() and notnull(). Either one will return a Boolean mask over the data. For example:
- In [3]: data = pd.Series([1, np.nan, 'hello', None])
- In [4]: data.isnull()
- Out[4]:
- 0 False
- 1 True
- 2 False
- 3 True
- dtype: bool
- In [5]: data[data.notnull()]
- Out[5]:
- 0 1
- 2 hello
- dtype: object
Dropping null values
In addition to the masking used before, there are the convenience methods, dropna() (which removes NA values) and fillna() (which fills in NA values). For a Series, the result is straightforward:
- In [6]: data.dropna()
- Out[6]:
- 0 1
- 2 hello
- dtype: object
- In [9]: df = pd.DataFrame([[1, np.nan, 2], [2, 3, 5], [np.nan, 4, 6]])
- In [10]: df
- Out[10]:
- 0 1 2
- 0 1.0 NaN 2
- 1 2.0 3.0 5
- 2 NaN 4.0 6
By default, dropna() will drop all rows in which any null value is present:
- In [11]: df.dropna()
- Out[11]:
- 0 1 2
- 1 2.0 3.0 5
- In [12]: df.dropna(axis='columns')
- Out[12]:
- 2
- 0 2
- 1 5
- 2 6
The default is how='any', such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify how='all', which will only drop rows/columns that are all null values:
- In [13]: df[3] = np.nan
- In [14]: df
- Out[14]:
- 0 1 2 3
- 0 1.0 NaN 2 NaN
- 1 2.0 3.0 5 NaN
- 2 NaN 4.0 6 NaN
- In [15]: df.dropna(axis='columns', how='all')
- Out[15]:
- 0 1 2
- 0 1.0 NaN 2
- 1 2.0 3.0 5
- 2 NaN 4.0 6
- In [16]: df.dropna(axis='rows', thresh=3)
- Out[16]:
- 0 1 2 3
- 1 2.0 3.0 5 NaN
Filling null values
Sometimes rather than dropping NA values, you’d rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the isnull() method as a mask, but because it is such a common operation Pandas provides the fillna() method, which returns a copy of the array with the null values replaced.
Consider the following Series:
- In [4]: data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
- In [5]: data
- Out[5]:
- a 1.0
- b NaN
- c 2.0
- d NaN
- e 3.0
- dtype: float64
- In [6]: data.fillna(0)
- Out[6]:
- a 1.0
- b 0.0
- c 2.0
- d 0.0
- e 3.0
- dtype: float64
- In [7]: data.fillna(method='ffill')
- Out[7]:
- a 1.0
- b 1.0
- c 2.0
- d 2.0
- e 3.0
- dtype: float64
- In [8]: data.fillna(method='bfill')
- Out[8]:
- a 1.0
- b 2.0
- c 2.0
- d 3.0
- e 3.0
- dtype: float64
- In [11]: df
- Out[11]:
- 0 1 2
- 0 1.0 NaN 2
- 1 2.0 3.0 5
- 2 NaN 4.0 6
- In [12]: df.fillna(method='ffill', axis=1)
- Out[12]:
- 0 1 2
- 0 1.0 1.0 2.0
- 1 2.0 3.0 5.0
- 2 NaN 4.0 6.0
Hierarchical Indexing
Up to this point we’ve been focused primarily on one-dimensional and twodimensional data, stored in Pandas Series and DataFrame objects, respectively. Often it is useful to go beyond this and store higher-dimensional data—that is, data indexed by more than one or two keys. While Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and four-dimensional data (see “Panel Data” on page 141), a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.
In this section, we’ll explore the direct creation of MultiIndex objects; considerations around indexing, slicing, and computing statistics across multiply indexed data; and useful routines for converting between simple and hierarchically indexed representations of your data. We begin with the standard imports:
- In [13]: import pandas as pd
- In [14]: import numpy as np
Let’s start by considering how we might represent two-dimensional data within a one-dimensional Series. For concreteness, we will consider a series of data where each point has a character and numerical key.
The bad way
Suppose you would like to track data about states from two different years. Using the Pandas tools we’ve already covered, you might be tempted to simply use Python tuples as keys:
- In [15]: index = [('California', 2000), ('California', 2010), ('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas', 2010)]
- In [16]: population = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]
- In [17]: pop = pd.Series(population, index=index)
- In [18]: pop
- Out[18]:
- (California, 2000) 33871648
- (California, 2010) 37253956
- (New York, 2000) 18976457
- (New York, 2010) 19378102
- (Texas, 2000) 20851820
- (Texas, 2010) 25145561
- dtype: int64
- In [19]: pop[('California', 2010):('Texas', 2000)]
- Out[19]:
- (California, 2010) 37253956
- (New York, 2000) 18976457
- (New York, 2010) 19378102
- (Texas, 2000) 20851820
- dtype: int64
- In [21]: pop[[i for i in pop.index if i[1] == 2010]]
- Out[21]:
- (California, 2010) 37253956
- (New York, 2010) 19378102
- (Texas, 2010) 25145561
- dtype: int64
The better way: Pandas MultiIndex
Fortunately, Pandas provides a better way. Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have. We can create a multi-index from the tuples as follows:
- In [22]: index = pd.MultiIndex.from_tuples(index)
- In [23]: index
- Out[23]:
- MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
- labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
- In [24]: pop = pop.reindex(index)
- In [25]: pop
- Out[25]:
- California 2000 33871648
- 2010 37253956
- New York 2000 18976457
- 2010 19378102
- Texas 2000 20851820
- 2010 25145561
- dtype: int64
Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:
- In [26]: pop[:, 2010]
- Out[26]:
- California 37253956
- New York 19378102
- Texas 25145561
- dtype: int64
MultiIndex as extra dimension
You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiplyindexed Series into a conventionally indexed DataFrame:
- In [27]: pop_df = pop.unstack()
- In [28]: pop_df
- Out[28]:
- 2000 2010
- California 33871648 37253956
- New York 18976457 19378102
- Texas 20851820 25145561
- In [29]: pop_df.stack()
- Out[29]:
- California 2000 33871648
- 2010 37253956
- New York 2000 18976457
- 2010 19378102
- Texas 2000 20851820
- 2010 25145561
- dtype: int64
- In [31]: pop_df = pd.DataFrame({'total':pop, 'under18':[9267089, 9284094, 4687374, 4318033, 5906301, 6879014]})
- In [32]: pop_df
- Out[32]:
- total under18
- California 2000 33871648 9267089
- 2010 37253956 9284094
- New York 2000 18976457 4687374
- 2010 19378102 4318033
- Texas 2000 20851820 5906301
- 2010 25145561 6879014
- In [33]: f_u18 = pop_df['under18'] / pop_df['total']
- In [34]: f_u18.unstack()
- Out[34]:
- 2000 2010
- California 0.273594 0.249211
- New York 0.247010 0.222831
- Texas 0.283251 0.273568
Methods of MultiIndex Creation
The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:
- In [35]: df = pd.DataFrame(np.random.rand(4,2), index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], columns=['data1', 'data2'])
- In [36]: df
- Out[36]:
- data1 data2
- a 1 0.299141 0.084822
- 2 0.459519 0.774651
- b 1 0.952114 0.258307
- 2 0.794663 0.276138
- In [39]: data = {('California', 2000): 12345, ('California', 2010): 12346, ('Texas', 2000): 67890, ('Texas', 2010): 67891, ('New York', 2000): 1, ('New York', 2010): 2}
- In [40]: pd.Series(data)
- Out[40]:
- California 2000 12345
- 2010 12346
- Texas 2000 67890
- 2010 67891
- New York 2000 1
- 2010 2
- dtype: int64
Explicit MultiIndex constructors
For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd.MultiIndex. For example, as we did before, you can construct the MultiIndex from a simple list of arrays, giving the index values within each level:
- In [41]: pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
- Out[41]:
- MultiIndex(levels=[['a', 'b'], [1, 2]],
- labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
- In [42]: pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
- Out[42]:
- MultiIndex(levels=[['a', 'b'], [1, 2]],
- labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
- In [43]: pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
- Out[43]:
- MultiIndex(levels=[['a', 'b'], [1, 2]],
- labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
MultiIndex level names
Sometimes it is convenient to name the levels of the MultiIndex. You can accomplish this by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact:
- In [9]: pop
- Out[9]:
- California 2000 33871648
- 2010 37253956
- New York 2000 18976457
- 2010 19378102
- Texas 2000 20851820
- 2010 25145561
- dtype: int64
- In [10]: pop.index.names = ['state', 'year']
- In [11]: pop
- Out[11]:
- state year
- California 2000 33871648
- 2010 37253956
- New York 2000 18976457
- 2010 19378102
- Texas 2000 20851820
- 2010 25145561
- dtype: int64
MultiIndex for columns
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:
- In [12]: index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit'])
- In [13]: columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
- In [14]: data = np.round(np.random.randn(4, 6), 1) # mock some data
- In [15]: data[:, ::2] *= 10
- In [16]: data += 37
- In [17]: data
- Out[17]:
- array([[33. , 35. , 33. , 38.5, 45. , 36.3],
- [41. , 38.1, 41. , 37. , 36. , 36.3],
- [31. , 36.8, 39. , 37.9, 26. , 36.9],
- [60. , 36.1, 41. , 37.4, 39. , 38.3]])
- In [18]: index
- Out[18]:
- MultiIndex(levels=[[2013, 2014], [1, 2]],
- labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
- names=['year', 'visit'])
- In [19]: health_data = pd.DataFrame(data, index=index, columns=columns)
- In [20]: health_data
- Out[20]:
- subject Bob Guido Sue
- type HR Temp HR Temp HR Temp
- year visit
- 2013 1 33.0 35.0 33.0 38.5 45.0 36.3
- 2 41.0 38.1 41.0 37.0 36.0 36.3
- 2014 1 31.0 36.8 39.0 37.9 26.0 36.9
- 2 60.0 36.1 41.0 37.4 39.0 38.3
Indexing and Slicing a MultiIndex
Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We’ll first look at indexing multiply indexed Series, and then multiply indexed DataFrames.
Multiply indexed Series
Consider the multiply indexed Series of state populations we saw earlier:
- In [21]: pop
- Out[21]:
- state year
- California 2000 33871648
- 2010 37253956
- New York 2000 18976457
- 2010 19378102
- Texas 2000 20851820
- 2010 25145561
- dtype: int64
- In [22]: pop['California', 2000]
- Out[22]: 33871648
- In [23]: pop['California']
- Out[23]:
- year
- 2000 33871648
- 2010 37253956
- dtype: int64
- In [24]: pop.loc['California':'New York']
- Out[24]:
- state year
- California 2000 33871648
- 2010 37253956
- New York 2000 18976457
- 2010 19378102
- dtype: int64
- In [25]: pop[:, 2000]
- Out[25]:
- state
- California 33871648
- New York 18976457
- Texas 20851820
- dtype: int64
- In [26]: pop[pop > 22000000]
- Out[26]:
- state year
- California 2000 33871648
- 2010 37253956
- Texas 2010 25145561
- dtype: int64
- In [27]: pop[['California', 'Texas']]
- Out[27]:
- state year
- California 2000 33871648
- 2010 37253956
- Texas 2000 20851820
- 2010 25145561
- dtype: int64
A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical DataFrame from before:
- In [28]: health_data
- Out[28]:
- subject Bob Guido Sue
- type HR Temp HR Temp HR Temp
- year visit
- 2013 1 33.0 35.0 33.0 38.5 45.0 36.3
- 2 41.0 38.1 41.0 37.0 36.0 36.3
- 2014 1 31.0 36.8 39.0 37.9 26.0 36.9
- 2 60.0 36.1 41.0 37.4 39.0 38.3
- In [29]: health_data['Guido', 'HR']
- Out[29]:
- year visit
- 2013 1 33.0
- 2 41.0
- 2014 1 39.0
- 2 41.0
- Name: (Guido, HR), dtype: float64
- In [30]: health_data.iloc[:2, :2]
- Out[30]:
- subject Bob
- type HR Temp
- year visit
- 2013 1 33.0 35.0
- 2 41.0 38.1
- In [31]: health_data.loc[:, ('Bob', 'HR')]
- Out[31]:
- year visit
- 2013 1 33.0
- 2 41.0
- 2014 1 31.0
- 2 60.0
- Name: (Bob, HR), dtype: float64
- In [32]: health_data.loc[(:, 1), (:, 'HR')]
- File "
" , line 1 - health_data.loc[(:, 1), (:, 'HR')]
- ^
- SyntaxError: invalid syntax
- In [34]: health_data.loc[idx[:, 1], idx[:, 'HR']]
- Out[34]:
- subject Bob Guido Sue
- type HR HR HR
- year visit
- 2013 1 33.0 33.0 45.0
- 2014 1 31.0 39.0 26.0
Rearranging Multi-Indices
One of the keys to working with multiply indexed data is knowing how to effectively transform the data. There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations. We saw a brief example of this in the stack() and unstack() methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we’ll explore them here.
Sorted and unsorted indices
Earlier, we briefly mentioned a caveat, but we should emphasize it more here. Many of the MultiIndex slicing operations will fail if the index is not sorted. Let’s take a look at this here.
We’ll start by creating some simple multiply indexed data where the indices are not lexographically sorted:
- In [3]: index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
- In [4]: data = pd.Series(np.random.rand(6), index=index)
- In [5]: data.index.names = ['char', 'int']
- In [6]: data
- Out[6]:
- char int
- a 1 0.501981
- 2 0.594293
- c 1 0.779701
- 2 0.517394
- b 1 0.699419
- 2 0.620710
- dtype: float64
- In [7]: try:
- ...: data['a':'b']
- ...: except KeyError as e:
- ...: print(type(e))
- ...: print(e)
- ...:
- <class 'pandas.errors.UnsortedIndexError'>
- 'Key length (1) was greater than MultiIndex lexsort depth (0)'
- In [6]: data = data.sort_index()
- In [7]: data
- Out[7]:
- char int
- a 1 0.164083
- 2 0.890591
- b 1 0.461503
- 2 0.991245
- c 1 0.750072
- 2 0.202940
- dtype: float64
- In [8]: data['a':'b']
- Out[8]:
- char int
- a 1 0.164083
- 2 0.890591
- b 1 0.461503
- 2 0.991245
- dtype: float64
As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:
- In [20]: pop.unstack(level=0)
- Out[20]:
- California New York Texas
- 2000 33871648 18976457 20851820
- 2010 37253956 19378102 25145561
- In [21]: pop.unstack(level=1)
- Out[21]:
- 2000 2010
- California 33871648 37253956
- New York 18976457 19378102
- Texas 20851820 25145561
- In [26]: pop.unstack().stack()
- Out[26]:
- California 2000 33871648
- 2010 37253956
- New York 2000 18976457
- 2010 19378102
- Texas 2000 20851820
- 2010 25145561
- dtype: int64
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:
- In [27]: pop_flat = pop.reset_index(name='population')
- In [38]: pop_flat
- Out[38]:
- state year population
- 0 California 2000 33871648
- 1 California 2010 37253956
- 2 New York 2000 18976457
- 3 New York 2010 19378102
- 4 Texas 2000 20851820
- 5 Texas 2010 25145561
- In [29]: pop_flat.__class__
- Out[29]: pandas.core.frame.DataFrame
- In [37]: pop_flat.set_index(['state', 'year'])
- Out[37]:
- population
- state year
- California 2000 33871648
- 2010 37253956
- New York 2000 18976457
- 2010 19378102
- Texas 2000 20851820
- 2010 25145561
Data Aggregations on Multi-Indices
We’ve previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on. For example, let’s return to our health data:
- In [48]: health_data
- Out[48]:
- subject Bob Guido Sue
- type HR Temp HR Temp HR Temp
- year visit
- 2013 1 53.0 38.2 49.0 38.0 39.0 38.1
- 2 45.0 37.6 21.0 38.0 35.0 37.6
- 2014 1 37.0 37.5 44.0 36.8 33.0 38.9
- 2 32.0 37.4 35.0 35.9 35.0 37.4
- In [49]: data_mean = health_data.mean(level='year')
- In [50]: data_mean
- Out[50]:
- subject Bob Guido Sue
- type HR Temp HR Temp HR Temp
- year
- 2013 49.0 37.90 35.0 38.00 37.0 37.85
- 2014 34.5 37.45 39.5 36.35 34.0 38.15
- In [51]: data_mean.mean(axis=1, level='type')
- Out[51]:
- type HR Temp
- year
- 2013 40.333333 37.916667
- 2014 36.000000 37.316667
沒有留言:
張貼留言