Source From Here
Comparisons, Masks, and Boolean Logic
This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays. Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold. In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
Example: Counting Rainy Days
Imagine you have a series of data that represents the amount of precipitation each day for a year in a given city. For example, here we’ll load the daily rainfall statistics for the city of Seattle in 2014, using Pandas:
The array contains 365 values, giving daily rainfall in inches from January 1 to December 31, 2014. As a first quick visualization, let’s look at the histogram of rainy days shown in Figure 2-6, which was generated using Matplotlib:
Figure 2-6. Histogram of 2014 rainfall in Seattle
This histogram gives us a general idea of what the data looks like: despite its reputation, the vast majority of days in Seattle saw near zero measured rainfall in 2014. But this doesn’t do a good job of conveying some information we’d like to see: for example, how many rainy days were there in the year? What is the average precipitation on those rainy days? How many days were there with more than half an inch of rain?
Digging into the data
One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range. For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of time writing code and time computing the result. We saw in “Computation on NumPy Arrays: Universal Functions” on page 50 that NumPy’s ufuncs can be used in place of loops to do fast element-wise arithmetic operations on arrays; in the same way, we can use other ufuncs to do element-wise comparisons over arrays, and we can then manipulate the results to answer the questions we have. We’ll leave the data aside for right now, and discuss some general tools in NumPy to use masking to quickly answer these types of questions.
Comparison Operators as ufuncs
In “Computation on NumPy Arrays: Universal Functions” on page 50 we introduced ufuncs, and focused in particular on arithmetic operators. We saw that using +, -, *, /, and others on arrays leads to element-wise operations. NumPy also implements comparison operators such as < (less than) and > (greater than) as element-wise ufuncs. The result of these comparison operators is always an array with a Boolean data type. All six of the standard comparison operations are available:
It is also possible to do an element-by-element comparison of two arrays, and to include compound expressions:
As in the case of arithmetic operators, the comparison operators are implemented as ufuncs in NumPy; for example, when you write x < 3, internally NumPy uses np.less(x, 3). A summary of the comparison operators and their equivalent ufunc is shown here:
Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape. Here is a two-dimensional example:
In each case, the result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these Boolean results.
Working with Boolean Arrays
Given a Boolean array, there are a host of useful operations you can do. We’ll work with x, the two-dimensional array we created earlier:
Counting entries
To count the number of True entries in a Boolean array, np.count_nonzero is useful:
We see that there are eight array entries that are less than 6. Another way to get at this information is to use np.sum; in this case, False is interpreted as 0, and True is interpreted as 1:
The benefit of sum() is that like with other NumPy aggregation functions, this summation can be done along rows or columns as well:
This counts the number of values less than 6 in each row of the matrix.
If we’re interested in quickly checking whether any or all the values are true, we can use (you guessed it) np.any() or np.all():
np.all() and np.any() can be used along particular axes as well. For example:
Here all the elements in the first and third rows are less than 8, while this is not the case for the second row.
Finally, a quick warning: as mentioned in “Aggregations: Min, Max, and Everything in Between” on page 58, Python has built-in sum(), any(), and all() functions. These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays. Be sure that you are using np.sum(), np.any(), and np.all() for these examples!
Boolean operators
We’ve already seen how we might count, say, all days with rain less than four inches, or all days with rain greater than two inches. But what if we want to know about all days with rain less than four inches and greater than one inch? This is accomplished through Python’s bitwise logic operators, &, |, ^, and ~. Like with the standard arithmetic operators, NumPy overloads these as ufuncs that work element-wise on (usually Boolean) arrays.
For example, we can address this sort of compound question as follows:
So we see that there are 29 days with rainfall between 0.5 and 1.0 inches. Note that the parentheses here are important—because of operator precedence rules, with parentheses removed this expression would be evaluated as follows, which results in an error:
Using the equivalence of A AND B and NOT (A OR B) (
which you may remember if you’ve taken an introductory logic course), we can compute the same result in a different manner:
Combining comparison operators and Boolean operators on arrays can lead to a wide range of efficient logical operations. The following table summarizes the bitwise Boolean operators and their equivalent ufuncs:
Using these tools, we might start to answer the types of questions we have about our weather data. Here are some examples of results we can compute when combining masking with aggregations:
Boolean Arrays as Masks
In the preceding section, we looked at aggregates computed directly on Boolean arrays. A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves. Returning to our x array from before, suppose we want an array of all values in the array that are less than, say, 5:
What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is True. We are then free to operate on these values as we wish. For example, we can compute some relevant statistics on our Seattle rain data:
By combining Boolean operations, masking operations, and aggregates, we can very quickly answer these sorts of questions for our dataset.
Fancy Indexing
In the previous sections, we saw how to access and modify portions of arrays using simple indices (e.g., arr[0]), slices (e.g., arr[:5]), and Boolean masks (e.g., arr[arr > 0]). In this section, we’ll look at another style of array indexing, known as fancy indexing. Fancy indexing is like the simple indexing we’ve already seen, but we pass arrays of indices in place of single scalars. This allows us to very quickly access and modify complicated subsets of an array’s values.
Exploring Fancy Indexing
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once. For example, consider the following array:
Suppose we want to access three different elements. We could do it like this:
Alternatively, we can pass a single list or array of indices to obtain the same result:
With fancy indexing, the shape of the result reflects the shape of the index arrays rather than the shape of the array being indexed:
Fancy indexing also works in multiple dimensions. Consider the following array:
Notice that the first value in the result is X[0, 2], the second is X[1, 1], and the third is X[2, 3]. The pairing of indices in fancy indexing follows all the broadcasting rules that were mentioned in “Computation on Arrays: Broadcasting” on page 63. So, for example, if we combine a column vector and a row vector within the indices, we get a two-dimensional result:
It is always important to remember with fancy indexing that the return value reflects the broadcasted shape of the indices, rather than the shape of the array being indexed.
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other indexing schemes we’ve seen:
All of these indexing options combined lead to a very flexible set of operations for accessing and modifying array values.
Example: Selecting Random Points
One common use of fancy indexing is the selection of subsets of rows from a matrix. For example, we might have an N by D matrix representing N points in D dimensions, such as the following points drawn from a two-dimensional normal distribution:
Using the plotting tools we will discuss in Chapter 4, we can visualize these points as a scatter plot (Figure 2-7):
Figure 2-7. Normally distributed points
Let’s use fancy indexing to select 20 random points. We’ll do this by first choosing 20 random indices with no repeats, and use these indices to select a portion of the original array:
Now to see which points were selected, let’s over-plot large circles at the locations of the selected points (Figure 2-8):
This sort of strategy is often used to quickly partition datasets, as is often needed in train/test splitting for validation of statistical models, and in sampling approaches to answering statistical questions.
Modifying Values with Fancy Indexing
Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an array. For example, imagine we have an array of indices and we’d like to set the corresponding items in an array to some value:
We can use any assignment-type operator for this. For example:
Notice, though, that repeated indices with these operations can cause some potentially unexpected results. Consider the following:
Where did the 4 go? The result of this operation is to first assign x[0] = 4, followed by x[0] = 6. The result, of course, is that x[0] contains the value 6. Fair enough, but consider this operation:
You might expect that x[3] would contain the value 2, and x[4] would contain the value 3, as this is how many times each index is repeated. Why is this not the case? Conceptually, this is because x[i] += 1 is meant as a shorthand of x[i] = x[i] + 1. x[i] + 1 is evaluated, and then the result is assigned to the indices in x. With this in mind, it is not the augmentation that happens multiple times, but the assignment, which leads to the rather nonintuitive results.
So what if you want the other behavior where the operation is repeated? For this, you can use the at() method of ufuncs (available since NumPy 1.8), and do the following:
Example: Binning Data
You can use these ideas to efficiently bin data to create a histogram by hand. For example, imagine we have 1,000 values and would like to quickly find where they fall within an array of bins. We could compute it using ufunc.at like this:
The counts now reflect the number of points within each bin—in other words, a histogram (
Figure 2-9):
Of course, it would be silly to have to do this each time you want to plot a histogram. This is why Matplotlib provides the plt.hist() routine, which does the same in a single line:
This function will create a nearly identical plot to the one seen here. To compute the binning, Matplotlib uses the
np.histogram function, which does a very similar computation to what we did before. Let’s compare the two here:
Our own one-line algorithm is several times faster than the optimized algorithm in NumPy! How can this be? If you dig into the np.histogram source code (you can do this in IPython by typing np.histogram??), you’ll see that it’s quite a bit more involved than the simple search-and-count that we’ve done; this is because NumPy’s algorithm is more flexible, and particularly is designed for better performance when the number of data points becomes large:
What this comparison shows is that algorithmic efficiency is almost never a simple question. An algorithm efficient for large datasets will not always be the best choice for small datasets, and vice versa. But the advantage of coding this algorithm yourself is that with an understanding of these basic methods, you could use these building blocks to extend this to do some very interesting custom behaviors. The key to efficiently using Python in data-intensive applications is knowing about general convenience routines like np.histogram and when they’re appropriate, but also knowing how to make use of lower-level functionality when you need more pointed behavior.
Sorting Arrays
Up to this point we have been concerned mainly with tools to access and operate on array data with NumPy. This section covers algorithms related to sorting values in NumPy arrays. These algorithms are a favorite topic in introductory computer science courses: if you’ve ever taken one, you probably have had dreams (or, depending on your temperament, nightmares) about insertion sorts, selection sorts, merge sorts, quick sorts, bubble sorts, and many, many more. All are means of accomplishing a similar task: sorting the values in a list or array.
For example, a simple selection sort repeatedly finds the minimum value from a list, and makes swaps until the list is sorted. We can code this in just a few lines of Python:
As any first-year computer science major will tell you, the selection sort is useful for its simplicity, but is much too slow to be useful for larger arrays. For a list of N values, it requires N loops, each of which does on the order of ~ N comparisons to find the swap value. In terms of the “big-O” notation often used to characterize these algorithms (see “Big-O Notation” on page 92), selection sort averages O(N^2) : if you double the number of items in the list, the execution time will go up by about a factor of four.
Even selection sort, though, is much better than my all-time favorite sorting algorithms, the bogosort:
This silly sorting method relies on pure chance: it repeatedly applies a random shuffling of the array until the result happens to be sorted. With an average scaling of O(N × N!) (that’s N times N factorial), this should—quite obviously—never be used for any real computation. Fortunately, Python contains built-in sorting algorithms that are much more efficient than either of the simplistic algorithms just shown. We’ll start by looking at the Python built-ins, and then take a look at the routines included in NumPy and optimized for NumPy arrays.
Fast Sorting in NumPy: np.sort and np.argsort
Although Python has built-in sort and sorted functions to work with lists, we won’t discuss them here because NumPy’s np.sort function turns out to be much more efficient and useful for our purposes. By default np.sort uses an O(N log N) , quicksort algorithm, though mergesort and heapsort are also available. For most applications, the default quicksort is more than sufficient.
To return a sorted version of the array without modifying the input, you can use np.sort:
If you prefer to sort the array in-place, you can instead use the sort method of arrays:
A related function is argsort, which instead returns the indices of the sorted elements:
Sorting along rows or columns
A useful feature of NumPy’s sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the axis argument. For example:
Keep in mind that this treats each row or column as an independent array, and any relationships between the row or column values will be lost!
Partial Sorts: Partitioning
Sometimes we’re not interested in sorting the entire array, but simply want to find the K smallest values in the array. NumPy provides this in the np.partition function. np.partition takes an array and a number K; the result is a new array with the smallest K values to the left of the partition, and the remaining values to the right, in arbitrary order:
Note that the first three values in the resulting array are the three smallest in the array, and the remaining array positions contain the remaining values. Within the two partitions, the elements have arbitrary order.
Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array:
The result is an array where the first two slots in each row contain the smallest values from that row, with the remaining values filling the remaining slots. Finally, just as there is a np.argsort that computes indices of the sort, there is a
np.argpartition that computes indices of the partition. We’ll see this in action in the following section.
Example: k-Nearest Neighbors
Let’s quickly see how we might use this argsort function along multiple axes to find the nearest neighbors of each point in a set. We’ll start by creating a random set of 10 points on a two-dimensional plane. Using the standard convention, we’ll arrange these in a 10×2 array:
To get an idea of how these points look, let’s quickly scatter plot them (Figure 2-10):
Now we’ll compute the distance between each pair of points. Recall that the squareddistance between two points is the sum of the squared differences in each dimension; using the efficient broadcasting (“Computation on Arrays: Broadcasting” on page 63) and aggregation (“Aggregations: Min, Max, and Everything in Between” on page 58) routines provided by NumPy, we can compute the matrix of square distances in a single line of code:
This operation has a lot packed into it, and it might be a bit confusing if you’re unfamiliar with NumPy’s broadcasting rules. When you come across code like this, it can be useful to break it down into its component steps:
Just to double-check what we are doing, we should see that the diagonal of this matrix (i.e., the set of distances between each point and itself) is all zero:
It checks out! With the pairwise square-distances converted, we can now use np.argsort to sort along each row. The leftmost columns will then give the indices of the nearest neighbors:
Notice that the first column gives the numbers 0 through 9 in order: this is due to the fact that each point’s closest neighbor is itself, as we would expect.
By using a full sort here, we’ve actually done more work than we need to in this case. If we’re simply interested in the nearest k neighbors, all we need is to partition each row so that the smallest k + 1 squared distances come first, with larger distances filling the remaining positions of the array. We can do this with the np.argpartition function:
In order to visualize this network of neighbors, let’s quickly plot the points along with lines representing the connections from each point to its two nearest neighbors (Figure 2-11):
Each point in the plot has lines drawn to its two nearest neighbors. At first glance, it might seem strange that some of the points have more than two lines coming out of them: this is due to the fact that if point A is one of the two nearest neighbors of point B, this does not necessarily imply that point B is one of the two nearest neighbors of point A.
Structured Data: NumPy’s Structured Arrays
While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas DataFrames, which we’ll explore in Chapter 3.
Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d like to store these values for use in a Python program. It would be possible to store these in three separate arrays:
But this is a bit clumsy. There’s nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all of this data. NumPy can handle this through structured arrays, which are arrays with compound data types. Recall that previously we created a simple array using an expression like this:
We can similarly create a structured array using a compound data type specification:
Here 'U10' translates to “Unicode string of maximum length 10,” 'i4' translates to “4-byte (i.e., 32 bit) integer,” and 'f8' translates to “8-byte (i.e., 64 bit) float.” We’ll discuss other options for these type codes in the following section. Now that we’ve created an empty container array, we can fill the array with our lists of values:
As we had hoped, the data is now arranged together in one convenient block of memory. The handy thing with structured arrays is that you can now refer to values either by index or by name:
Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:
Note that if you’d like to do any operations that are any more complicated than these, you should probably consider the Pandas package, covered in the next chapter. As we’ll see, Pandas provides a DataFrame object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we’ve shown here, as well as much, much more.
Creating Structured Arrays
Structured array data types can be specified in a number of ways. Earlier, we saw the dictionary method. For clarity, numerical types can be specified with Python types or NumPy dtypes instead:
A compound type can also be specified as a list of tuples:
If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:
The shortened string format codes may seem confusing, but they are built on simple principles. The first (optional) character is < or >, which means “little endian” or “big endian,” respectively (Endianness), and specifies the ordering convention for significant bits. The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see Table 2-4). The last character or characters represents the size of the object in bytes.
More Advanced Compound Types
It is possible to define even more advanced compound types. For example, you can create a type where each element contains an array or matrix of values. Here, we’ll create a data type with a mat component consisting of a 3×3 floating-point matrix:
Now each element in the X array consists of an id and a 3×3 matrix. Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary? The reason is that this NumPy dtype directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program. If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you’ll probably find structured arrays quite useful!
RecordArrays: Structured Arrays with a Twist
NumPy also provides the np.recarray class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys. Recall that we previously accessed the ages by writing:
If we view our data as a record array instead, we can access this with slightly fewer keystrokes:
The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:
Whether the more convenient notation is worth the additional overhead will depend on your own application.
Supplement
* FAQ - How can I open the interactive Matplotlib window in IPython notebook?
Comparisons, Masks, and Boolean Logic
This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays. Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold. In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
Example: Counting Rainy Days
Imagine you have a series of data that represents the amount of precipitation each day for a year in a given city. For example, here we’ll load the daily rainfall statistics for the city of Seattle in 2014, using Pandas:
The array contains 365 values, giving daily rainfall in inches from January 1 to December 31, 2014. As a first quick visualization, let’s look at the histogram of rainy days shown in Figure 2-6, which was generated using Matplotlib:
- %matplotlib inline
- import matplotlib.pyplot as plt
- plt.hist(inches, 40)
This histogram gives us a general idea of what the data looks like: despite its reputation, the vast majority of days in Seattle saw near zero measured rainfall in 2014. But this doesn’t do a good job of conveying some information we’d like to see: for example, how many rainy days were there in the year? What is the average precipitation on those rainy days? How many days were there with more than half an inch of rain?
Digging into the data
One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range. For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of time writing code and time computing the result. We saw in “Computation on NumPy Arrays: Universal Functions” on page 50 that NumPy’s ufuncs can be used in place of loops to do fast element-wise arithmetic operations on arrays; in the same way, we can use other ufuncs to do element-wise comparisons over arrays, and we can then manipulate the results to answer the questions we have. We’ll leave the data aside for right now, and discuss some general tools in NumPy to use masking to quickly answer these types of questions.
Comparison Operators as ufuncs
In “Computation on NumPy Arrays: Universal Functions” on page 50 we introduced ufuncs, and focused in particular on arithmetic operators. We saw that using +, -, *, /, and others on arrays leads to element-wise operations. NumPy also implements comparison operators such as < (less than) and > (greater than) as element-wise ufuncs. The result of these comparison operators is always an array with a Boolean data type. All six of the standard comparison operations are available:
It is also possible to do an element-by-element comparison of two arrays, and to include compound expressions:
As in the case of arithmetic operators, the comparison operators are implemented as ufuncs in NumPy; for example, when you write x < 3, internally NumPy uses np.less(x, 3). A summary of the comparison operators and their equivalent ufunc is shown here:
Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape. Here is a two-dimensional example:
In each case, the result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these Boolean results.
Working with Boolean Arrays
Given a Boolean array, there are a host of useful operations you can do. We’ll work with x, the two-dimensional array we created earlier:
Counting entries
To count the number of True entries in a Boolean array, np.count_nonzero is useful:
We see that there are eight array entries that are less than 6. Another way to get at this information is to use np.sum; in this case, False is interpreted as 0, and True is interpreted as 1:
The benefit of sum() is that like with other NumPy aggregation functions, this summation can be done along rows or columns as well:
This counts the number of values less than 6 in each row of the matrix.
If we’re interested in quickly checking whether any or all the values are true, we can use (you guessed it) np.any() or np.all():
np.all() and np.any() can be used along particular axes as well. For example:
Here all the elements in the first and third rows are less than 8, while this is not the case for the second row.
Finally, a quick warning: as mentioned in “Aggregations: Min, Max, and Everything in Between” on page 58, Python has built-in sum(), any(), and all() functions. These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays. Be sure that you are using np.sum(), np.any(), and np.all() for these examples!
Boolean operators
We’ve already seen how we might count, say, all days with rain less than four inches, or all days with rain greater than two inches. But what if we want to know about all days with rain less than four inches and greater than one inch? This is accomplished through Python’s bitwise logic operators, &, |, ^, and ~. Like with the standard arithmetic operators, NumPy overloads these as ufuncs that work element-wise on (usually Boolean) arrays.
For example, we can address this sort of compound question as follows:
So we see that there are 29 days with rainfall between 0.5 and 1.0 inches. Note that the parentheses here are important—because of operator precedence rules, with parentheses removed this expression would be evaluated as follows, which results in an error:
- inches > (0.5 & inches) < 1
Combining comparison operators and Boolean operators on arrays can lead to a wide range of efficient logical operations. The following table summarizes the bitwise Boolean operators and their equivalent ufuncs:
Using these tools, we might start to answer the types of questions we have about our weather data. Here are some examples of results we can compute when combining masking with aggregations:
Boolean Arrays as Masks
In the preceding section, we looked at aggregates computed directly on Boolean arrays. A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves. Returning to our x array from before, suppose we want an array of all values in the array that are less than, say, 5:
What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is True. We are then free to operate on these values as we wish. For example, we can compute some relevant statistics on our Seattle rain data:
By combining Boolean operations, masking operations, and aggregates, we can very quickly answer these sorts of questions for our dataset.
Fancy Indexing
In the previous sections, we saw how to access and modify portions of arrays using simple indices (e.g., arr[0]), slices (e.g., arr[:5]), and Boolean masks (e.g., arr[arr > 0]). In this section, we’ll look at another style of array indexing, known as fancy indexing. Fancy indexing is like the simple indexing we’ve already seen, but we pass arrays of indices in place of single scalars. This allows us to very quickly access and modify complicated subsets of an array’s values.
Exploring Fancy Indexing
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once. For example, consider the following array:
Suppose we want to access three different elements. We could do it like this:
Alternatively, we can pass a single list or array of indices to obtain the same result:
With fancy indexing, the shape of the result reflects the shape of the index arrays rather than the shape of the array being indexed:
Fancy indexing also works in multiple dimensions. Consider the following array:
Notice that the first value in the result is X[0, 2], the second is X[1, 1], and the third is X[2, 3]. The pairing of indices in fancy indexing follows all the broadcasting rules that were mentioned in “Computation on Arrays: Broadcasting” on page 63. So, for example, if we combine a column vector and a row vector within the indices, we get a two-dimensional result:
It is always important to remember with fancy indexing that the return value reflects the broadcasted shape of the indices, rather than the shape of the array being indexed.
Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other indexing schemes we’ve seen:
All of these indexing options combined lead to a very flexible set of operations for accessing and modifying array values.
Example: Selecting Random Points
One common use of fancy indexing is the selection of subsets of rows from a matrix. For example, we might have an N by D matrix representing N points in D dimensions, such as the following points drawn from a two-dimensional normal distribution:
- mean = [0, 0]
- cov = [[1, 2], [2, 5]]
- X = rand.multivariate_normal(mean, cov, 100)
- X.shape
- %matplotlib inline
- import matplotlib.pyplot as plt
- import seaborn; seaborn.set() # for plot styling
- plt.scatter(X[:, 0], X[:, 1]);
Let’s use fancy indexing to select 20 random points. We’ll do this by first choosing 20 random indices with no repeats, and use these indices to select a portion of the original array:
Now to see which points were selected, let’s over-plot large circles at the locations of the selected points (Figure 2-8):
- plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
- plt.scatter(selection[:, 0], selection[:, 1], facecolor='none', s=200);
This sort of strategy is often used to quickly partition datasets, as is often needed in train/test splitting for validation of statistical models, and in sampling approaches to answering statistical questions.
Modifying Values with Fancy Indexing
Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an array. For example, imagine we have an array of indices and we’d like to set the corresponding items in an array to some value:
We can use any assignment-type operator for this. For example:
Notice, though, that repeated indices with these operations can cause some potentially unexpected results. Consider the following:
Where did the 4 go? The result of this operation is to first assign x[0] = 4, followed by x[0] = 6. The result, of course, is that x[0] contains the value 6. Fair enough, but consider this operation:
You might expect that x[3] would contain the value 2, and x[4] would contain the value 3, as this is how many times each index is repeated. Why is this not the case? Conceptually, this is because x[i] += 1 is meant as a shorthand of x[i] = x[i] + 1. x[i] + 1 is evaluated, and then the result is assigned to the indices in x. With this in mind, it is not the augmentation that happens multiple times, but the assignment, which leads to the rather nonintuitive results.
So what if you want the other behavior where the operation is repeated? For this, you can use the at() method of ufuncs (available since NumPy 1.8), and do the following:
Example: Binning Data
You can use these ideas to efficiently bin data to create a histogram by hand. For example, imagine we have 1,000 values and would like to quickly find where they fall within an array of bins. We could compute it using ufunc.at like this:
- np.random.seed(42)
- x = np.random.randn(100)
- # compute a histogram by hand
- bins = np.linspace(-5, 5, 20)
- counts = np.zeros_like(bins)
- # find the appropriate bin for each x
- i = np.searchsorted(bins, x)
- # add 1 to each of these bins
- np.add.at(counts, i, 1)
- # plot the results
- plt.plot(bins, counts, linestyle='steps');
Of course, it would be silly to have to do this each time you want to plot a histogram. This is why Matplotlib provides the plt.hist() routine, which does the same in a single line:
- plt.hist(x, bins, histtype='step');
Our own one-line algorithm is several times faster than the optimized algorithm in NumPy! How can this be? If you dig into the np.histogram source code (you can do this in IPython by typing np.histogram??), you’ll see that it’s quite a bit more involved than the simple search-and-count that we’ve done; this is because NumPy’s algorithm is more flexible, and particularly is designed for better performance when the number of data points becomes large:
What this comparison shows is that algorithmic efficiency is almost never a simple question. An algorithm efficient for large datasets will not always be the best choice for small datasets, and vice versa. But the advantage of coding this algorithm yourself is that with an understanding of these basic methods, you could use these building blocks to extend this to do some very interesting custom behaviors. The key to efficiently using Python in data-intensive applications is knowing about general convenience routines like np.histogram and when they’re appropriate, but also knowing how to make use of lower-level functionality when you need more pointed behavior.
Sorting Arrays
Up to this point we have been concerned mainly with tools to access and operate on array data with NumPy. This section covers algorithms related to sorting values in NumPy arrays. These algorithms are a favorite topic in introductory computer science courses: if you’ve ever taken one, you probably have had dreams (or, depending on your temperament, nightmares) about insertion sorts, selection sorts, merge sorts, quick sorts, bubble sorts, and many, many more. All are means of accomplishing a similar task: sorting the values in a list or array.
For example, a simple selection sort repeatedly finds the minimum value from a list, and makes swaps until the list is sorted. We can code this in just a few lines of Python:
As any first-year computer science major will tell you, the selection sort is useful for its simplicity, but is much too slow to be useful for larger arrays. For a list of N values, it requires N loops, each of which does on the order of ~ N comparisons to find the swap value. In terms of the “big-O” notation often used to characterize these algorithms (see “Big-O Notation” on page 92), selection sort averages O(N^2) : if you double the number of items in the list, the execution time will go up by about a factor of four.
Even selection sort, though, is much better than my all-time favorite sorting algorithms, the bogosort:
This silly sorting method relies on pure chance: it repeatedly applies a random shuffling of the array until the result happens to be sorted. With an average scaling of O(N × N!) (that’s N times N factorial), this should—quite obviously—never be used for any real computation. Fortunately, Python contains built-in sorting algorithms that are much more efficient than either of the simplistic algorithms just shown. We’ll start by looking at the Python built-ins, and then take a look at the routines included in NumPy and optimized for NumPy arrays.
Fast Sorting in NumPy: np.sort and np.argsort
Although Python has built-in sort and sorted functions to work with lists, we won’t discuss them here because NumPy’s np.sort function turns out to be much more efficient and useful for our purposes. By default np.sort uses an O(N log N) , quicksort algorithm, though mergesort and heapsort are also available. For most applications, the default quicksort is more than sufficient.
To return a sorted version of the array without modifying the input, you can use np.sort:
If you prefer to sort the array in-place, you can instead use the sort method of arrays:
A related function is argsort, which instead returns the indices of the sorted elements:
Sorting along rows or columns
A useful feature of NumPy’s sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the axis argument. For example:
Keep in mind that this treats each row or column as an independent array, and any relationships between the row or column values will be lost!
Partial Sorts: Partitioning
Sometimes we’re not interested in sorting the entire array, but simply want to find the K smallest values in the array. NumPy provides this in the np.partition function. np.partition takes an array and a number K; the result is a new array with the smallest K values to the left of the partition, and the remaining values to the right, in arbitrary order:
Note that the first three values in the resulting array are the three smallest in the array, and the remaining array positions contain the remaining values. Within the two partitions, the elements have arbitrary order.
Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array:
The result is an array where the first two slots in each row contain the smallest values from that row, with the remaining values filling the remaining slots. Finally, just as there is a np.argsort that computes indices of the sort, there is a
np.argpartition that computes indices of the partition. We’ll see this in action in the following section.
Example: k-Nearest Neighbors
Let’s quickly see how we might use this argsort function along multiple axes to find the nearest neighbors of each point in a set. We’ll start by creating a random set of 10 points on a two-dimensional plane. Using the standard convention, we’ll arrange these in a 10×2 array:
To get an idea of how these points look, let’s quickly scatter plot them (Figure 2-10):
- %matplotlib inline
- import matplotlib.pyplot as plt
- plt.scatter(X[:, 0], X[:, 1], s=100);
Now we’ll compute the distance between each pair of points. Recall that the squareddistance between two points is the sum of the squared differences in each dimension; using the efficient broadcasting (“Computation on Arrays: Broadcasting” on page 63) and aggregation (“Aggregations: Min, Max, and Everything in Between” on page 58) routines provided by NumPy, we can compute the matrix of square distances in a single line of code:
This operation has a lot packed into it, and it might be a bit confusing if you’re unfamiliar with NumPy’s broadcasting rules. When you come across code like this, it can be useful to break it down into its component steps:
Just to double-check what we are doing, we should see that the diagonal of this matrix (i.e., the set of distances between each point and itself) is all zero:
It checks out! With the pairwise square-distances converted, we can now use np.argsort to sort along each row. The leftmost columns will then give the indices of the nearest neighbors:
Notice that the first column gives the numbers 0 through 9 in order: this is due to the fact that each point’s closest neighbor is itself, as we would expect.
By using a full sort here, we’ve actually done more work than we need to in this case. If we’re simply interested in the nearest k neighbors, all we need is to partition each row so that the smallest k + 1 squared distances come first, with larger distances filling the remaining positions of the array. We can do this with the np.argpartition function:
In order to visualize this network of neighbors, let’s quickly plot the points along with lines representing the connections from each point to its two nearest neighbors (Figure 2-11):
- plt.scatter(X[:, 0], X[:, 1], s=100)
- # draw lines from each point to its two nearest neighbors
- K = 2
- for i in range(X.shape[0]):
- for j in nearest_partition[i, :K+1]:
- # plot a line from X[i] to X[j]
- # use some zip magic to make it happen:
- plt.plot(*zip(X[j], X[i]), color='black')
Each point in the plot has lines drawn to its two nearest neighbors. At first glance, it might seem strange that some of the points have more than two lines coming out of them: this is due to the fact that if point A is one of the two nearest neighbors of point B, this does not necessarily imply that point B is one of the two nearest neighbors of point A.
Structured Data: NumPy’s Structured Arrays
While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy’s structured arrays and record arrays, which provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas DataFrames, which we’ll explore in Chapter 3.
Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we’d like to store these values for use in a Python program. It would be possible to store these in three separate arrays:
- name = ['Alice', 'Bob', 'Cathy', 'Doug']
- age = [25, 45, 37, 19]
- weight = [55.0, 85.5, 68.0, 61.5]
We can similarly create a structured array using a compound data type specification:
Here 'U10' translates to “Unicode string of maximum length 10,” 'i4' translates to “4-byte (i.e., 32 bit) integer,” and 'f8' translates to “8-byte (i.e., 64 bit) float.” We’ll discuss other options for these type codes in the following section. Now that we’ve created an empty container array, we can fill the array with our lists of values:
As we had hoped, the data is now arranged together in one convenient block of memory. The handy thing with structured arrays is that you can now refer to values either by index or by name:
Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:
Note that if you’d like to do any operations that are any more complicated than these, you should probably consider the Pandas package, covered in the next chapter. As we’ll see, Pandas provides a DataFrame object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we’ve shown here, as well as much, much more.
Creating Structured Arrays
Structured array data types can be specified in a number of ways. Earlier, we saw the dictionary method. For clarity, numerical types can be specified with Python types or NumPy dtypes instead:
A compound type can also be specified as a list of tuples:
If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:
The shortened string format codes may seem confusing, but they are built on simple principles. The first (optional) character is < or >, which means “little endian” or “big endian,” respectively (Endianness), and specifies the ordering convention for significant bits. The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see Table 2-4). The last character or characters represents the size of the object in bytes.
More Advanced Compound Types
It is possible to define even more advanced compound types. For example, you can create a type where each element contains an array or matrix of values. Here, we’ll create a data type with a mat component consisting of a 3×3 floating-point matrix:
Now each element in the X array consists of an id and a 3×3 matrix. Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary? The reason is that this NumPy dtype directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program. If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you’ll probably find structured arrays quite useful!
RecordArrays: Structured Arrays with a Twist
NumPy also provides the np.recarray class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys. Recall that we previously accessed the ages by writing:
If we view our data as a record array instead, we can access this with slightly fewer keystrokes:
The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:
Whether the more convenient notation is worth the additional overhead will depend on your own application.
Supplement
* FAQ - How can I open the interactive Matplotlib window in IPython notebook?
- %matplotlib tk
This message was edited 91 times. Last update was at 01/09/2018 09:48:25