Outlook

In this chapter, we are going to learn mathematical data manipulations and plotting using NumPy and matplotlib in depth. As we already have seen, NumPy is the core mathematical tool for numerical computing with Python. We first cover how to handle mathematical objects using NumPy. And later, we learn how to produce good looking plots of those mathematical data sets using Matplotlib – another core plotting tool with Python.

Reading Materials

The materials of this chapter in part have been adopted and modified from:

Array manipulations in NumPy

In scientific computing, numerical arrays are essential to hold a sequence of numbers. For example, an array can contain:

  • values of an experiment/simulation at discrete space and time steps,
  • signal recorded by a measurement device, e.g. sound wave,
  • pixels of an image, grey-level or colour,
  • 3D data measured at different X-Y-Z positions, e.g. MRI scan, etc.

Wait a minute... we already have something like this, i.e., Python lists! Unfortunately, Python lists are not suitable for numerical computing. Let’s first see why.

Python lists vs. NumPy arrays

Python has lists as a built-in data type, e.g.:

>>> x = [1., 2., 3.]

defines a list that contains 3 real numbers and might be viewed as a vector. However, you cannot easily do arithmetic on such lists the way you can with vectors in Matlab, e.g. 2*x does not give [2., 4., 6.] as you might hope:

>>> 2*x
[1.0, 2.0, 3.0, 1.0, 2.0, 3.0]

instead it doubles the length of x, and x+x would give the same thing, which is not what you want in mathematical operations.

Two-dimensional arrays are also a bit clumsy in Python, as they have to be specified as a list of lists, e.g., a 3x2 array with the elements 11,12 in the first row, 21,22 in the second row, 31,32 in the third row would be specified by:

>>> A = [[11, 12], [21, 22], [31, 32]]

Note that indexing always starts with 0 in Python, so we find for example that:

>>> A[0]
[11, 12]

>>> A[1]
[21, 22]

>>> A[1][0]
21

Here A[0] refers to the 0-index element of A, which is itself a list [11, 12]. and A[1][0] can be understood as the 0-index element of A[1] = [21, 22].

You cannot work with A as a matrix, for example to multiply it by a vector, except by writing code that loops over the elements explicitly.

NumPy was developed to make it easy to do the sorts of things we want to do with matrices and vectors, and more generally n-dimensional arrays of real numbers.

For example:

>>> import numpy as np
>>> x = np.array([1., 2., 3.])
>>> x
array([ 1.,  2.,  3.])

>>> 2*x
array([ 2.,  4.,  6.])

>>> np.sqrt(x)
array([ 1.        ,  1.41421356,  1.73205081])

We see that we can multiply by a scalar or take component-wise square roots.

You may find it ugly to have to start NumPy command with np., as necessary here since we imported NumPy as np. Instead you could do:

>>> from numpy import *

and then just use sqrt, for example, and you will get the NumPy version. But in these notes and many Python examples you’ll see, the module is explicitly listed so it is clear where a function is coming from.

Performance comparison

NumPy arrays are more memory efficeint than the usual Python lists in handling fast numerical operations. To see the different behavior, we can use timeit module which gives a simple way to time small bits of Python code. For a more full description, see https://docs.python.org/2/library/timeit.html.

For instance, timing an operation on a regular Python’s list:

>>> import timeit
>>> setup = '''
... L=range(1000)
... '''
>>> timeit.timeit('[i**2 for i in L]',setup=setup)
51.10083198547363

Using the IPython shell, one can use the convenient %timeit special function instead:

In [22]: L = range(1000)
In [23]: %timeit [i**2 for i in L]
10000 loops, best of 3: 50.8 µs per loop

Notice that this is little bit different from the previous output from a standard Python interpreter, and gives a performance time per loop.

On the contrary, timimg an operation on a NumPy’s array of the same size:

>>> import timeit  # not needed if imported already
>>> setup = '''
... import numpy as np
... a = np.arange(1000)
... '''
>>> timeit.timeit('a**2',setup=setup)
0.8447329998016357

As before, if using the IPython shell:

In [20]: a = arange(1000)
In [21]: %timeit a**2
The slowest run took 8940.42 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 861 ns per loop

Again, this is a time it took per loop.

Creating NumPy data arrays

We looked at a brief example of initializing a NumPy array in the above, where individual items are given explicitly as inputs. In practice, however, we would want to avoid adding items one by one. We can instead use as follows.

  • Evenly spaced values:

    >>> import numpy as np
    >>> a = np.arange(10) # np.arange(n) provides indices from 0 to n-1
    >>> a
    array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    >>> b = np.arange(1., 9., 2) # syntax : start, end, step
    >>> b
    array([ 1.,  3.,  5.,  7.])
    

    or by specifying the number of points:

    >>> c = np.linspace(0, 1, 6)
    >>> c
    array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])
    >>> d = np.linspace(0, 1, 5, endpoint=False)
    >>> d
    array([ 0. ,  0.2,  0.4,  0.6,  0.8])
    
  • To specifically initialize a real type array, you can do:

    >>> a = np.arange(10, dtype=np.float) # same as a = np.arange(10, dtype=np.float64)
    

    Otherwise, the default setup is integer array:

    >>> a = np.arange(10) # same as a = np.arange(10, dtype = np.int)
    
  • Constructors for common arrays:

    >>> a = np.ones((3,3))
    >>> a
    array([[ 1.,  1.,  1.],
           [ 1.,  1.,  1.],
           [ 1.,  1.,  1.]])
    >>> a.dtype
    dtype('float64')
    >>> b = np.ones(5, dtype=np.int)
    >>> b
    array([1, 1, 1, 1, 1])
    >>> c = np.zeros((2,2))
    >>> c
    array([[ 0.,  0.],
           [ 0.,  0.]])
    >>> d = np.eye(3)
    >>> d
    array([[ 1.,  0.,  0.],
           [ 0.,  1.,  0.],
           [ 0.,  0.,  1.]])
    

NumPy arrays indexing

The items of an array can be accessed the same way as other Python sequences (list, tuple):

>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = a[0], a[2], a[-1]
>>> b
(0, 2, 9)
>>> type(b)
tuple

For a multidimensional array a, indexes are tuples of integers, that is, for a 3D array, a[(i,j,k)], or simply a[i,j,k]:

>>> a = np.diag(np.arange(5))
>>> a
array([[0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 2, 0, 0],
       [0, 0, 0, 3, 0],
       [0, 0, 0, 0, 4]])
>>> a[1,1]
1
>>> a[2,1] = 10 # third line (or row), second column
>>> a
array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  0,  0,  0],
       [ 0, 10,  2,  0,  0],
       [ 0,  0,  0,  3,  0],
       [ 0,  0,  0,  0,  4]])
>>> a[2]   # this is the same as a[2,:]
array([ 0, 10,  2,  0,  0])

Note that:

  • In 2D, the first dimension corresponds to lines (i.e., rows), the second to columns.
  • for an array a with more than one dimension, a[i] is interpreted as a[i,:].

NumPy array slicing

Like indexing, it’s similar to Python sequences slicing:

>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> a[2:9:3] # [start:end:step]
array([2, 5, 8])

Note that the last index is not included!:

>>> a[:4]
array([0, 1, 2, 3])

start:end:step is a slice object which represents the set of indexes range(start, end, step). A slice can be explicitly created:

>>> sl = slice(1, 9, 2)
>>> a = np.arange(10)
>>> b = 2*a + 1
>>> a, b
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19]))
>>> a[sl], b[sl]
(array([1, 3, 5, 7]), array([ 3,  7, 11, 15]))

All three slice components are not required: by default, start is 0, end is the last and step is 1:

>>> a[1:3]
array([1, 2])
>>> a[::2]
array([0, 2, 4, 6, 8])
>>> a[3:]
array([3, 4, 5, 6, 7, 8, 9])
  • What is the difference between a[0:1] and a[0::1]?

Of course, it works with multidimensional arrays:

>>> a = np.eye(5)
>>> a
array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])
>>> a[2:4,:3] #3rd and 4th lines, 3 first columns
array([[ 0.,  0.,  1.],
       [ 0.,  0.,  0.]])

All elements specified by a slice can be easily modified:

>>> a[:3,:3] = 4
>>> a
array([[ 4.,  4.,  4.,  0.,  0.],
       [ 4.,  4.,  4.,  0.,  0.],
       [ 4.,  4.,  4.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

A small illustrated summary of NumPy indexing and slicing:

../../_images/numpy_indexing.png

Figure 1: NumPy array indexing.

You can also combine assignment and slicing:

>>> a = np.arange(10)
>>> a[5:] = 10
>>> a
array([ 0,  1,  2,  3,  4, 10, 10, 10, 10, 10])
>>> b = np.arange(5)
>>> a[5:] = b[::-1]
>>> a
array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0])

Copies and views

A slicing operation creates a view on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. You can use np.may_share_memory() to check if two arrays share the same memory block. Note, however, that this uses heuristics and may give you false positives.

When modifying the view, the original array is modified as well:

>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = a[::2]
>>> b
array([0, 2, 4, 6, 8])
>>> np.may_share_memory(a, b)
True
>>> b[0] = 12
>>> b
array([12,  2,  4,  6,  8])
>>> a   # (!)
array([12,  1,  2,  3,  4,  5,  6,  7,  8,  9])

>>> id(a[0])
4491149720
>>> id(b[0])
4491149720

>>> id(a)
4584290864
>>> id(b)
4584290304

>>> a = np.arange(10)
>>> c = a[::2].copy()  # force a copy
>>> c[0] = 12
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

>>> np.may_share_memory(a, c)
False

Exercise: Extend the above example and use for n in range(len(b)): to write a short for-loop that checks if the memory addresses of elements of b are the same as those of a[::2]. Repeat the same to check c and a.

Component-wise operations

One thing to watch out for if you are used to Matlab notations: In Matlab some operations (such as sqrt, sin, cos, exp, etc) can be applied to vectors or matrices and will automatically be applied component-wise. Other operations like * and / (multiplication and division) attempt to do things in terms of linear algebra, and so in Matlab, A*B gives the matrix product and only makes sense if the number of columns of A agrees with the number of rows of B. If you want a component-wise product of A and B you must use .* instead, that is, with a period . before the *.

In NumPy, * and / are applied component-wise, like any other operation. To get a matrix-product you must use dot:

>>> A = np.array([[1,2], [3,4]])
>>> B = np.array([[5,0], [0,7]])
>>> A
array([[1, 2],
       [3, 4]])
>>> B
array([[5, 0],
       [0, 7]])

>>> A*B          # this will not give a matrix-product, but a component-wise product
array([[ 5,  0],
       [ 0, 28]])

>>> np.dot(A,B)  # this is the expected matrix multiplication!
array([[ 5, 14],
       [15, 28]])

>>> np.dot(B,A) # this should be different from np.dot(A,B)!
array([[ 5, 10],
       [21, 28]])

Many other linear algebra tools can be found in NumPy. For example, to solve a linear system Ax = b using Gaussian Elimination, we can do:

>>> A
array([[1, 2],
       [3, 4]])

>>> b = np.array([2,3])
>>> x = np.linalg.solve(A,b)

>>> x
array([-1. ,  1.5])

To find the eigenvalues and eigenvectors of A:

>>> evals, evecs = np.linalg.eig(A)

>>> evals
array([-0.37228132,  5.37228132])

>>> evecs
array([[-0.82456484, -0.41597356],
       [ 0.56576746, -0.90937671]])

Note: You may be tempted to use the variable name lambda for the eigenvalues of a matrix, but this isn’t allowed in Python because lambda is a keyword of the language, see Lambda functions.

Basic reductions of NumPy arrays

To compute sums:

>>> x = np.array([1, 2, 3, 4])
>>> np.sum(x)
10
>>> x.sum()
10

Sum by rows and by columns:

>>> x = np.array([[1, 2], [3, 4]])
>>> x
array([[1, 2],
       [3, 4]])
>>> x.sum(axis=0)   # columns (first dimension)
array([4, 6])
>>> x[:, 0].sum(), x[:, 1].sum()
(4, 6)
>>> x.sum(axis=1)   # rows (second dimension)
array([3, 7])
>>> x[0, :].sum(), x[1, :].sum()
(3, 7)

Same idea in higher dimensions:

>>> x = np.array([ [[1,2],[3,4]], [[5,6],[7,8]] ] )
>>> x
array([[[1, 2],
        [3, 4]],

       [[5, 6],
        [7, 8]]])
>>> x.sum(axis = 0)
array([[ 6,  8],
       [10, 12]])
>>> x.sum(axis = 1)
array([[ 4,  6],
       [12, 14]])
>>> x.sum(axis = 2)
array([[ 3,  7],
       [11, 15]])

To obtain extrema:

>>> x = np.array([1, 3, 2])
>>> x.min()
1
>>> x.max()
3

>>> x.argmin()  # index of minimum
0
>>> x.argmax()  # index of maximum
1

Logical operations:

>>> np.all([True, True, False])
False
>>> np.any([True, True, False])
True

A few examples in statistics:

>>> x = np.array([1, 2, 3, 1])
>>> y = np.array([[1, 2, 3], [5, 6, 1]])
>>> x.mean()
1.75
>>> np.median(x)
1.5
>>> np.median(y, axis=-1) # last axis
array([ 2.,  5.])

>>> x.std()          # full population standard dev.
0.82915619758884995

Loading data files

Text files

Consider an example: populations.txt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# year	hare	lynx	carrot
1900	30e3	4e3	48300
1901	47.2e3	6.1e3	48200
1902	70.2e3	9.8e3	41500
1903	77.4e3	35.2e3	38200
1904	36.3e3	59.4e3	40600
1905	20.6e3	41.7e3	39800
1906	18.1e3	19e3	38600
1907	21.4e3	13e3	42300
1908	22e3	8.3e3	44500
1909	25.4e3	9.1e3	42100
1910	27.1e3	7.4e3	46000
1911	40.3e3	8e3	46800
1912	57e3	12.3e3	43800
1913	76.6e3	19.5e3	40900
1914	52.3e3	45.7e3	39400
1915	19.5e3	51.1e3	39000
1916	11.2e3	29.7e3	36700
1917	7.6e3	15.8e3	41800
1918	14.6e3	9.7e3	43300
1919	16.2e3	10.1e3	41300
1920	24.7e3	8.6e3	47300

To load populations.txt:

>>> data = np.loadtxt('./codes/populations.txt')
>>> data
array([[  1900.,  30000.,   4000.,  48300.],
       [  1901.,  47200.,   6100.,  48200.],
       [  1902.,  70200.,   9800.,  41500.],
...

To save it to another name, pop2.txt:

>>> np.savetxt('pop2.txt', data)
>>> data2 = np.loadtxt('pop2.txt')

Other well-known file formats

Other types of file formats that Python can open and read include:

  • NetCDF: scipy.io.netcdf_file, netcdf4-python, ...
  • Matlab: scipy.io.loadmat, scipy.io.savemat
  • IDL: scipy.io.readsav