程式扎記: [ Py DS ] Ch2 - Introduction to NumPy (Part1)

Source From Here
Preface
This chapter, along with Chapter 3, outlines techniques for effectively loading, storing, and manipulating in-memory data in Python. The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else. Despite this apparent heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.

For example, images—particularly digital images—can be thought of as simply twodimensional arrays of numbers representing pixel brightness across the area. Sound clips can be thought of as one-dimensional arrays of intensity versus time. Text can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words. No matter what the data are, the first step in making them analyzable will be to transform them into arrays of numbers.

For this reason, efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science. We’ll now take a look at the specialized tools that Python has for handling such numerical arrays: the NumPy package and the Pandas package (discussed in Chapter 3.)

This chapter will cover NumPy in detail. NumPy (short for Numerical Python) provides an efficient interface to store and operate on dense data buffers. In some ways, NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size. NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.

You can go to the NumPy website and follow the installation instructions found there. Once you do, you can import NumPy and double-check the version:

>>> import numpy
>>> numpy.__version__
'1.11.3'

Throughout this chapter, and indeed the rest of the book, you’ll find that this is the way we will import and use NumPy:

>>> import numpy as np

Understanding Data Types in Python
Effective data-driven science and computation requires understanding how data is stored and manipulated. This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this. Understanding this difference is fundamental to understanding much of the material throughout the rest of the book.

Users of Python are often drawn in by its ease of use, one piece of which is dynamic typing. While a statically typed language like C or Java requires each variable to be explicitly declared, a dynamically typed language like Python skips this specification. This sort of flexibility is one piece that makes Python and other dynamically typed languages convenient and easy to use. Understanding how this works is an important piece of learning to analyze data efficiently and effectively with Python. But what this type flexibility also points to is the fact that Python variables are more than just their value; they also contain extra information about the type of the value. We’ll explore this more in the sections that follow.

A Python Integer Is More Than Just an Integer
The standard Python implementation is written in C. This means that every Python object is simply a cleverly disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as x = 10000, x is not just a “raw” integer. It’s actually a pointer to a compound C structure, which contains several values. Looking through the Python 3.4 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):

view plaincopy to clipboardprint?
struct _longobject {  
    long ob_refcnt;  
    PyTypeObject *ob_type;  
    size_t ob_size;  
    long ob_digit[1];  
};  

A single integer in Python 3.4 actually contains four pieces:

• ob_refcnt, a reference count that helps Python silently handle memory allocation and deallocation
• ob_type, which encodes the type of the variable
• ob_size, which specifies the size of the following data members
• ob_digit, which contains the actual integer value that we expect the Python variable to represent

This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in Figure 2-1.

Here PyObject_HEAD is the part of the structure containing the reference count, type code, and other pieces mentioned before.

Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value. A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value. This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically. All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects.

A Python List Is More Than Just a List
Let’s consider now what happens when we use a Python data structure that holds many Python objects. The standard mutable multielement container in Python is the list. We can create a list of integers as follows:

In [1]: L = list(range(10))

In [2]: L
Out[2]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [3]: type(L[0])
Out[3]: int

Because of Python’s dynamic typing, we can even create heterogeneous lists:

In [4]: L3 = [True, "2", 3.0, 4]

In [5]: [type(item) for item in L3]
Out[5]: [bool, str, float, int]

But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other information—that is, each item is a complete Python object. In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array. The difference between a dynamic-type list and a fixed-type (NumPy-style) array is illustrated in Figure 2-2.

Fixed-Type Arrays in Python
Python offers several different options for storing data in efficient, fixed-type data buffers. The built-in array module (available since Python 3.3) can be used to create dense arrays of a uniform type:

In [7]: L = list(range(10))

In [8]: A = array.array('i', L)

In [9]: A
Out[9]: array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Here 'i' is a type code indicating the contents are integers. Much more useful, however, is the ndarray object of the NumPy package. While Python’s array object provides efficient storage of array-based data, NumPy adds to this efficient operations on that data. We will explore these operations in later sections; here we’ll demonstrate several ways of creating a NumPy array.

We’ll start with the standard NumPy import, under the alias np:

In [10]: import numpy as np

Creating Arrays from Python Lists
First, we can use np.array to create arrays from Python lists:

In [11]: np.array([1, 4, 2, 5, 3])
Out[11]: array([1, 4, 2, 5, 3])

Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if possible (here, integers are upcast to floating point):

In [12]: np.array([3.14, 4, 2, 3])
Out[12]: array([3.14, 4. , 2. , 3. ])

If we want to explicitly set the data type of the resulting array, we can use the dtype keyword:

In [14]: np.array([1, 2, 3], dtype='float32')
Out[14]: array([1., 2., 3.], dtype=float32)

Finally, unlike Python lists, NumPy arrays can explicitly be multidimensional; here’s one way of initializing a multidimensional array using a list of lists:

In [16]: np.array([range(i, i+3) for i in [2, 4, 6]])
Out[16]:
array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])

The inner lists are treated as rows of the resulting two-dimensional array.

Creating Arrays from Scratch
Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy. Here are several examples:

# Create a length-10 integer array filled with zeros
In [17]: np.zeros(10, dtype=int)
Out[17]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

# Create a 3x5 floating-point array filled with 1s
In [18]: np.ones((3, 5), dtype=float)
Out[18]:
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])

# Create a 3x5 array filled with 3.14
In [19]: np.full((3, 5), 3.14)
Out[19]:
array([[3.14, 3.14, 3.14, 3.14, 3.14],
[3.14, 3.14, 3.14, 3.14, 3.14],
[3.14, 3.14, 3.14, 3.14, 3.14]])

# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
In [20]: np.arange(0, 20, 2)
Out[20]: array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])

# Create an array of five values evenly spaced between 0 and 1
In [21]: np.linspace(0, 1, 5)
Out[21]: array([0. , 0.25, 0.5 , 0.75, 1. ])

# Create a 3x3 array of uniformly distributed
# random values between 0 and 1
In [22]: np.random.random((3, 3))
Out[22]:
array([[0.58672574, 0.17019719, 0.16979153],
[0.67292611, 0.82211835, 0.46544083],
[0.52492393, 0.83406001, 0.87660246]])

# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
In [23]: np.random.normal(0, 1, (3, 3))
Out[23]:
array([[-1.26474069, -1.66540897, 0.69832817],
[-1.19743365, 0.39172697, 1.10055077],
[-0.61757546, 1.17670274, -1.06770586]])

# Create a 3x3 array of random integers in the interval [0, 10)
In [24]: np.random.randint(0, 10, (3, 3))
Out[24]:
array([[0, 1, 9],
[2, 0, 7],
[0, 7, 0]])

# Create a 3x3 identity matrix
In [25]: np.eye(3)
Out[25]:
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])

# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that
# memory location
In [26]: np.empty(3)
Out[26]: array([1., 1., 1.])

NumPy Standard Data Types
NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations. Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages. The standard NumPy data types are listed in Table 2-1. Note that when constructing an array, you can specify them using a string:

view plaincopy to clipboardprint?
np.zeros(10, dtype='int16')  

Or using the associated NumPy object:

view plaincopy to clipboardprint?
np.zeros(10, dtype=np.int16)  

More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the NumPy documentation. NumPy also supports compound data types, which will be covered in later sections.

The Basics of NumPy Arrays
Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas (Chapter 3) are built around the NumPy array. This section will present several examples using NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays. While the types of operations shown here may seem a bit dry and pedantic, they comprise the building blocks of many other examples used throughout the book. Get to know them well!

We’ll cover a few categories of basic array manipulations here:

* Attributes of arrays: Determining the size, shape, memory consumption, and data types of arrays
* Indexing of arrays: Getting and setting the value of individual array elements
* Slicing of arrays: Getting and setting smaller subarrays within a larger array
* Reshaping of arrays: Changing the shape of a given array
* Joining and splitting of arrays: Combining multiple arrays into one, and splitting one array into many

NumPy Array Attributes
First let’s discuss some useful array attributes. We’ll start by defining three random arrays: a one-dimensional, two-dimensional, and three-dimensional array. We’ll use NumPy’s random number generator, which we will seed with a set value in order to ensure that the same random arrays are generated each time this code is run:

Each array has attributes ndim (the number of dimensions), shape (the size of each dimension), and size (the total size of the array):

In [34]: print("x3 ndim={}; shape={}; size={}; dtype={} ".format(x3.ndim, x3.shape, x3.size, x3.dtype))
x3 ndim=3; shape=(3, 4, 5); size=60; dtype=int32

Other attributes include itemsize, which lists the size (in bytes) of each array element, and nbytes, which lists the total size (in bytes) of the array:

In [35]: print("itemsize: {}; nbytes: {}".format(x3.itemsize, x3.nbytes))
itemsize: 4; nbytes: 240

In general, we expect that nbytes is equal to itemsize times size.

Array Indexing: Accessing Single Elements
If you are familiar with Python’s standard list indexing, indexing in NumPy will feel quite familiar. In a one-dimensional array, you can access the ith value (counting from zero) by specifying the desired index in square brackets, just as with Python lists:

In [36]: x1
Out[36]: array([5, 0, 3, 3, 7, 9])

In [37]: x1[0]
Out[37]: 5

In [38]: x1[4]
Out[38]: 7

# To index from the end of the array, you can use negative indices:
In[8]: x1[-1]
Out[8]: 9

In[9]: x1[-2]
Out[9]: 7

# In a multidimensional array, you access items using a comma-separated tuple of indices:
In[10]: x2
Out[10]: array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])

In[11]: x2[0, 0]
Out[11]: 3

In[12]: x2[2, 0]
Out[12]: 1

In[13]: x2[2, -1]
Out[13]: 7

# You can also modify values using any of the above index notation:
In[14]: x2[0, 0] = 12

In[15]: x2
Out[14]: array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])

Keep in mind that, unlike Python lists, NumPy arrays have a fixed type. This means, for example, that if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. Don’t be caught unaware by this behavior!

In[15]: x1[0] = 3.14159 # this will be truncated!
In[16]: x1
Out[15]: array([3, 0, 3, 3, 7, 9])

Array Slicing: Accessing Subarrays
Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon (:) character. The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x, use this:

x[start:stop:step]

If any of these are unspecified, they default to the values start=0, stop=size of dimension, step=1. We’ll take a look at accessing subarrays in one dimension and in multiple dimensions.

One-dimensional subarrays

In [2]: x = np.arange(10)

In [3]: x
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [4]: x[:5]
Out[4]: array([0, 1, 2, 3, 4])

In [5]: x[5:]
Out[5]: array([5, 6, 7, 8, 9])

In [6]: x[4:7]
Out[6]: array([4, 5, 6])

In [8]: x[::2]
Out[8]: array([0, 2, 4, 6, 8])

In [9]: x[1::2]
Out[9]: array([1, 3, 5, 7, 9])

A potentially confusing case is when the step value is negative. In this case, the defaults for start and stop are swapped. This becomes a convenient way to reverse an array:

In [10]: x[::-1] # all elements, reversed
Out[10]: array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

In [11]: x[5::-2] # Reversed every other from index 5
Out[11]: array([5, 3, 1])

Multidimensional subarrays
Multidimensional slices work in the same way, with multiple slices separated by commas. For example:

In [18]: x2
Out[18]:
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])

In [19]: x2[:2, :3] # two rows, three columns
Out[19]:
array([[5, 0, 3],
[7, 9, 3]])

In [20]: x2[:, ::3] # all rows, every other column
Out[20]:
array([[5, 3],
[7, 5],
[2, 6]])

In [21]: x2[::-1, ::-1] # Finally, subarray dimensions can even be reversed together
Out[21]:
array([[6, 7, 4, 2],
[5, 3, 9, 7],
[3, 3, 0, 5]])

Accessing array rows and columns. One commonly needed routine is accessing single rows or columns of an array. You can do this by combining indexing and slicing, using an empty slice marked by a single colon (:):

In [22]: print(x2[:, 0]) # first column of x2
[5 7 2]

In [23]: print(x2[0, :]) # first row of x2
[5 0 3 3]

In [24]: print(x2[0]) # equivalent to x[0, :]
[5 0 3 3]

Subarrays as no-copy views
One important—and extremely useful—thing to know about array slices is that they return views rather than copies of the array data. This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies. Consider our two-dimensional array from before:

In [25]: x2
Out[25]:
array([[5, 0, 3, 3],
[7, 9, 3, 5],
[2, 4, 7, 6]])

Let’s extract a 2×2 subarray from this:

In [26]: x2[:2, :2]
Out[26]:
array([[5, 0],
[7, 9]])

Now if we modify this subarray, we’ll see that the original array is changed! Observe:

In [8]: x2_sub[0, 0] = 99

In [9]: print(x2_sub)
[[99 0]
[ 7 9]]

In [10]: print(x2)
[[99 0 3 3]
[ 7 9 3 5]
[ 2 4 7 6]]

This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.

Creating copies of arrays
Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the copy() method:

In [11]: x2_sub_copy = x2[:2, :2].copy()

In [12]: print(x2_sub_copy)
[[99 0]
[ 7 9]]

# If we now modify this subarray, the original array is not touched:
In [13]: x2_sub_copy[0, 0] = 42

In [14]: print(x2_sub_copy)
[[42 0]
[ 7 9]]
In [15]: print(x2)
[[99 0 3 3]
[ 7 9 3 5]
[ 2 4 7 6]]

Reshaping of Arrays
Another useful type of operation is reshaping of arrays. The most flexible way of doing this is with the reshape() method. For example, if you want to put the numbers 1 through 9 in a 3×3 grid, you can do the following:

In [16]: grid = np.arange(1, 10).reshape((3, 3))

In [17]: print(grid)
[[1 2 3]
[4 5 6]
[7 8 9]]

Note that for this to work, the size of the initial array must match the size of the reshaped array. Where possible, the reshape method will use a no-copy view of the initial array, but with noncontiguous memory buffers this is not always the case.

Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix. You can do this with the reshape method, or more easily by making use of the newaxis keyword within a slice operation:

In [2]: import numpy as np

In [3]: x = np.array([1, 2, 3])

In [4]: x[np.newaxis, :] # row vector via newaxis
Out[4]: array([[1, 2, 3]])

In [5]: x.reshape((3, 1)) # column vector via reshape
Out[5]:
array([[1],
[2],
[3]])

In [6]: x[:, np.newaxis]
Out[6]:
array([[1],
[2],
[3]])

Array Concatenation and Splitting
All of the preceding routines worked on single arrays. It’s also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays. We’ll take a look at those operations here.

Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished through the routines np.concatenate, np.vstack, and np.hstack. np.concatenate takes a tuple or list of arrays as its first argument, as we can see here:

In [7]: x = np.array([1, 2, 3])
In [8]: y = np.array([3, 2, 1])
In [10]: np.concatenate([x, y])
Out[10]: array([1, 2, 3, 3, 2, 1])

# You can also concatenate more than two arrays at once:
In [11]: z = [99, 99, 99]
In [12]: np.concatenate([x, y, z])
Out[12]: array([ 1, 2, 3, 3, 2, 1, 99, 99, 99])

# np.concatenate can also be used for two-dimensional arrays:
In [13]: grid = np.array([[1, 2, 3], [4, 5, 6]])

In [14]: np.concatenate([grid, grid])
Out[14]:
array([[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 6]])

In [15]: np.concatenate([grid, grid], axis=1) # concatenate along the second axis (zero-indexed)
Out[15]:
array([[1, 2, 3, 1, 2, 3],
[4, 5, 6, 4, 5, 6]])

For working with arrays of mixed dimensions, it can be clearer to use the np.vstack (vertical stack) and np.hstack (horizontal stack) functions:

In [16]: x = np.array([1, 2, 3])

In [17]: grid = np.array([[9, 8, 7], [6, 5, 4]])

In [18]: np.vstack([x, grid]) # vertically stack the arrays
Out[18]:
array([[1, 2, 3],
[9, 8, 7],
[6, 5, 4]])

In [19]: y = np.array([[99], [99]])

In [20]: np.hstack([grid, y]) # horizontally stack the arrays
Out[20]:
array([[ 9, 8, 7, 99],
[ 6, 5, 4, 99]])

Similarly, np.dstack will stack arrays along the third axis.

Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions np.split, np.hsplit, and np.vsplit. For each of these, we can pass a list of indices giving the split points:

In [1]: x = [1, 2, 3, 99, 99, 3, 2, 1]
In [4]: x1, x2, x3 = np.split(x, [3, 5]) # First split [0-2], Second split [3-4], Last split [5:]

In [5]: print(x1, x2, x3)
[1 2 3] [99 99] [3 2 1]

Notice that N split points lead to N + 1 subarrays. The related functions np.hsplit and np.vsplit are similar:

In [6]: grid = np.arange(16).reshape((4, 4))

In [7]: grid
Out[7]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])

In [8]: upper, lower = np.vsplit(grid, [2]) # First split (0, 1); Second split=(2, 3)

In [9]: print(upper)
[[0 1 2 3]
[4 5 6 7]]

In [10]: print(lower)
[[ 8 9 10 11]
[12 13 14 15]]

In [11]: left, right = np.hsplit(grid, [2]) # Horizontal split

In [12]: print(left)
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]

In [13]: print(right)
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]

Similarly, np.dsplit will split arrays along the third axis.

程式扎記

標籤

2018年7月20日星期五

[ Py DS ] Ch2 - Introduction to NumPy (Part1)

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

檢舉濫用情形

學習筆記

標籤

2018年7月20日 星期五