Scientific Computing with Python using NumPy#

A Tufts University Data Lab Workshop
Written by Uku-Kaspar Uustalu

Website: tuftsdatalab.github.io/intro-numpy
Contact: datalab-support@elist.tufts.edu
Resources: datalab.tufts.edu

Last updated: 2021-11-15

Introduction to NumPy#

NumPy is the primary scientific computing library in Python that adds support for fast and efficient handling of large multi-dimensional arrays (matrices). It resembles MATLAB both in its functionality and design, making it a popular open-source alternative to the proprietary numerical computing software. However, there are some key differences between NumPy and MATLAB that this tutorial will attempt to outline.

The primary building block of NumPy is the numpy.ndarray, also known by and commonly referred to using the alias numpy.array or np.array. It is a homogeneous multi-dimensional (n-dimensional) array that is fixed in size. (Note that this is somewhat different from MATLAB that defaults to 2D matrices.) The numpy np.array differs from a regular Python list in three key ways:

All elements in a np.array must be the same datatype. (For example, only integers or only floating-point numbers, but never both.)
The datatype of an np.array or its elements cannot be changed in-place. To do so, a new array must be created and elements have to be copied over and re-cast into a different datatype.
The size of a np.array is fixed - new elements cannot be added nor can elements be removed. However, elements can be overwritten as long as you keep the same datatype.

Note that Python also contains a type of array called array.array which supports only one dimension and hence is very different from the NumPy np.array.

Importing NumPy#

To start working with NumPy in Python, we first need to import it. As all NumPy elements are actually objects of the library itself, one needs to type out the name of the library all the time when using it. (This applies to most libraries in Python.) Hence, it is common to import numpy under the alias np to avoid having to type out the full numpy over and over again.

import numpy as np

NumPy Arrays#

The easiest way to create a NumPy np.array is to use the np.array() constructor with an existing Python list or tuple.

np.array([1, 2, 3])

array([1, 2, 3])

To create a multi-dimensional np.array, we can simply use a nested list or tuple.

np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

np.array((((1, 2), (3, 4)), ((5, 6), (7, 8)), ((9, 10), (11, 12))))

array([[[ 1,  2],
        [ 3,  4]],

       [[ 5,  6],
        [ 7,  8]],

       [[ 9, 10],
        [11, 12]]])

Note that the np.array() constructor expects a single list or tuple as the first argument. A common error, especially for MATLAB users, is to call np.array() with multiple arguments (separate array elements) instead of a singe list or tuple containing all the elements of the desired array.

# this will result in an error
np.array(1, 2, 3)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 2
      1 # this will result in an error
----> 2 np.array(1, 2, 3)

TypeError: array() takes from 1 to 2 positional arguments but 3 were given

# instead you should pass a *single* list as the first argument
np.array([1, 2, 3])

If desired, we can use the dtype argument to set the datatype of the np.array upon creation.

np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=float)

np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=str)

np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=complex)

Array Attributes#

NumPy arrays have multiple useful attributes that give us information about the array. Here are some of the most common and useful ones.

.size gives you the number of elements in the array
.shape gives you the dimensions (size in each dimension) of the array
.ndim gives you the number of dimensions (dimensionality) of the array
.dtype gives you the datatype of the array

Note that in MATLAB size gives you the dimensions of an array while in NumPy it gives you the number of elements. Hence, MATLAB users should take care to use shape to get the dimensions of an array instead of size.

For additional NumPy array attributes, check out the documentation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

a = np.array([[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]],
              [[13, 14, 15, 16], [17, 18, 19, 20], [21, 22, 23, 24]]])
print(a)

# number of elements
a.size

# dimensions
a.shape

# number of dimensions
a.ndim

# datatype
a.dtype

Other ways of Creating Arrays#

Often we might want to create an empty array with desired dimensions or initialize an array with a common value. NumPy contains a lot of helpful constructors for that.

np.zeros() creates an array and fills it with zeros
np.ones() creates an array and fills it with ones
np.full() creates an array and sets all elements to a specified value

For a one-dimensional array, you just pass the number of elements to these constructors.

np.zeros(3)

np.ones(4)

np.full(5, fill_value=7)

If we would like a multi-dimensional array instead, we simply pass a tuple or list with the dimensions.

np.zeros((2, 3))

np.ones((2, 3, 4))

np.full([3, 2], fill_value=7)

Note how the order of the dimensions in the tuple is from highest to lowest. This order of dimensions applies across all of NumPy.

(2, 3) creates an array with two rows and three elements in each row
(2, 3, 4) creates an array with two panes with each pane consisting of three rows, each of which have four elements

Also note how both np.zeros() and np.ones() create an array with floating-point elements by default. If we want an integer array, we have to specify that using the dtype argument.

np.zeros((2, 3), dtype=int)

np.ones([2, 3, 4], dtype=int)

There also exists a constructor np.empty() that creates an uninitialized array of specified dimensions. This means that it allocates space in the computer memory for this array, but does not change the contents of said memory. The result is not an array one would usually consider empty. Instead you get an array filled with random garbage, also commonly referred to as dead squirrels in computing jargon. Basically, an array created with np.empty() contains the numeric representation of whatever was in memory before the creation of the array.

The reason one would use np.empty() would be to quickly create an array that will be completely filled with totally new values later on. It should definitely not be used in cases where an empty array filled with zeros is desired as a starting point for a simulation or to iteratively compute array element values based on neighboring values. If you would like an array filled with zeros (that most would consider empty), be sure to use np.zeros() instead.

np.empty([5, 6])

To create a sequence of elements, use np.arange() or np.linspace().

np.arange() works similarly to the built-in Python range() function and should be used for integer sequences
np.linspace() takes a start and end point and the number of elements desired (instead of a step) and is more suitable for use with floating-point numbers

np.arange(10)

np.arange(-10, 10, 2)

Note how np.arange() uses a half-open interval [start, stop) just like the normal range() function, meaning the the stop is excluded from the output.

np.arange(11)

np.arange(-10, 11, 2)

While possible, it is not recommended to use np.arange() with floating-point numbers as is it difficult to predict the final number of elements. (This is due to the somewhat imprecise way floating-point numbers are stored in computer memory.) Hence, it is recommended you use np.linspace() when working with floating-point numbers. (Note that np.linspace() can also be used with integers.)

np.linspace(0, np.pi, 20)

There are other more niche constructors in NumPy. For example, you can use np.eye() to create a unit matrix.

np.eye(5)

Creating Random Arrays#

Note that although np.empty() gives you an array with random garbage, the values themselves might not be random at all. To create an array with truly random values, we should use constructors from np.random.

np.random.random() gives you an array with random elements uniformly distributed in an half-open range of [0.0, 1.0)
np.random.normal() allows you to specify the mean and standard deviation of the uniform distribution of the random elements

np.random.random((3, 4))

np.random.normal(0, 1, (3, 4))  # mean of zero and a standard deviation of one

Re-run the blocks above and note how the random numbers change. While this is useful in some situations, it is detrimental to reproducibility. Often we would like to create some random values once and then use those same randomly generated values throughout our research or project. Because of this, you should always set a random seed when working with random variables. Like this, the randomness is determined by the seed and you will always get the same random numbers when using the same seed, allowing you to share your work and reproduce your results. A random seed could be any integer.

np.random.seed(42)
np.random.random((3, 4))

Now re-run the block above and note how the random numbers remain the same across different runs.

Basic Operations#

All arithmetic operators on NumPy arrays apply element-wise.

a = np.array([[ 1,  2,  3],
              [ 4,  5,  6],
              [ 7,  8,  9]])

b = np.array([[10, 11, 12],
              [13, 14, 15],
              [16, 17, 18]])

a + 1

b - 1

a + b

a - b

a * 2

a * b

b / 2

b / a

b // a

Note how the * operator does element-wise multiplication in NumPy while in MATLAB it does matrix multiplication.

To do matrix multiplication in NumPy, you must use the .dot() method or the @ operator.

np.dot(a, b)

a.dot(b)

a @ b

Logical operators also apply element-wise in NumPy and return a boolean array.

a > b

a < b

a == 2

# true if element is even
a % 2 == 0

Broadcasting#

Note how we could easily do both arithmetic and logical operations between NumPy arrays and scalar values (single numbers). This is because NumPy does something called broadcasting, which allows you to do arithmetic operations between arrays and scalars and also between arrays of different but compatible dimensions. Basically, numpy will copy over the single value or either of the arrays as many times as needed in order to match up the dimensions.

You can learn more about broadcasting here: https://numpy.org/doc/stable/user/basics.broadcasting.html

a = np.array([[1, 1, 1],
              [1, 1, 1],
              [1, 1, 1]])

b = np.array([2, 3, 4])

c = np.array([[2],
              [3],
              [4]])

a * b

a * c

b * c

Modifying Arrays In-Place#

NumPy also allows the use of in-place operators like += and *= that should be well-familiar to C/C++ programmers. These allow you modify the values of the array in-place without returning a new array.

a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

a += 1

print(a)

a *= 2

print(a)

Universal Functions#

NumPy also contains numerous functions for common mathematical operations These operate element-wise and also follow the broadcasting behavior described above. In NumPy documentation they are called universal functions and also commonly referred to as ufuncs.

You can find a list of all available universal functions here: https://numpy.org/doc/stable/reference/ufuncs.html

a = np.array([[ 1,  2,  3],
              [ 4,  5,  6],
              [ 7,  8,  9]])

b = np.array([[10, 11, 12],
              [13, 14, 15],
              [16, 17, 18]])

np.add(a, b)

np.add(a, 1)

np.exp(a)

np.sqrt(a)

np.sin(b)

Aggregation Functions#

NumPy also contains some handy aggregation functions that either operate on the whole array or along a specified axis of the array.

a = np.array([[1, 2],
              [5, 3],
              [4, 6]])

a.max()

a.min()

a.sum()

To run these aggregations along a certain axis, you have to specify the axis number using the axis named argument. The axis numbers range from 0 to ndim-1 with the axis of the lowest dimension having the number 0 and the axis of the highest dimension having the number ndim-1. However, most of the time you will be dealing with two-dimensional arrays, in which case it is good to just keep in mind the following.

axis=0 preforms the operation across rows and results in a single output value for each column
axis=1 preforms the operation across columns and results in a single output value for each row

a.max(axis=0)

a.max(axis=1)

Indexing and Slicing#

You can select elements or ranges of elements from NumPy arrays as you would from a built-in Python list. If you are a avid MATLAB user, just keep in mind these three key differences:

Python uses zero-based indexing, meaning that the fist element of an array (or list) is at position zero.
Square brackets [ ] are the indexing operator in Python.
Negative indices count from the end, meaning that the last element of an array (or list) is at position [-1].

a = np.array([[ 1,  2,  3,  4],
              [ 5,  6,  7,  8],
              [ 9, 10, 11, 12],
              [13, 14, 15, 16],
              [17, 18, 19, 20]])
print(a)

a[0]

a[1]

a[1:3]

Remember that when using [start:end] to slice in Python, the end index is exclusive, meaning that the element at index end is not included in the slice.

You can also use [start:end:step] with NumPy arrays. (Remember that omitting the start index means slice from beginning and omitting the end index means slice until end.)

a[::2]

It is good to know that using -1 as the step when slicing reverses the selection.

a[0][::-1]

a[::-1]

To access elements from multi-dimensional arrays, we can use chained indexing.

a[-1][0]

a[0][1:3]

However, chained indexing has its limitations. For example, slicing a multi-dimensional array also returns a multi-dimensional array, often leading to confusion when using chained indexing.

You can also index multi-dimensional NumPy arrays by including multiple comma-separated indices or ranges in the [ ] indexing operator, one for each dimension. It is recommended to use this approach as opposed to chained indexing. Note that the order of dimensions is again from highest to lowest.

# print out the matrix again for reference
print(a)

# second element in first row
a[0, 1]

# first element in second row
a[1, 0]

# entire second row
a[1, :]

# entire second column
a[:, 1]

# the last element from the second and third rows
a[1:3, -1]

# the middle 3x2 selection
a[1:4, 1:3]

# the upper-right-most 4x3 selection
a[:4, 1:]

You can use indexing to change single elements and slicing to change entire selections.

a[0, 0] = 0
print(a)

a[3:, 2:] = 0
print(a)

Boolean Indexing#

To select elements from a NumPy array you can also use a boolean array with the same exact dimensions as the array you are trying to select elements from. Every element where the corresponding element in the supplied boolean array is True gets selected. This is also known as boolean indexing.

a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

b = np.array([[True,  False, False],
              [False, True,  False],
              [False, False, True]])

a[b]

However, most of the time it is unrealistic to manually create a boolean array for indexing.

Luckily we know from before that NumPy applies logical operators element-wise, resulting in a boolean array. We can use this to easily select elements from an array based on a desired condition.

# get boolean array that is true if element is even
a % 2 == 0

# extract all even elements
a[a % 2 == 0]

# extract all odd elements
a[a % 2 != 0]

# extract all elements larger than the mean
a[a > a.mean()]

Iterating#

Iterating over multi-dimensional NumPy arrays is done with respect to the highest dimension (first axis).

a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

for row in a:
    print(row)

To iterate over each element of a multi-dimensional array, one may use nested loops.

for row in a:
    for element in row:
        print(element)

However, nested loops are often inefficient and could easily lead to confusion and unmaintainable code. Hence, it is recommended to avoid nested loops if possible.

Luckily for you, NumPy includes functionality for easily iterating over all objects in a NumPy array. For example, you could use the .flat attribute.

for element in a.flat:
    print(element)

Note that .flat returns an iterator. Basically, that is just something that tells Python how to iterate over all the elements of the array using a for loop. It does not actually return a flattened one-dimensional version of the original array. To do that, we can use the .flatten() method.

a.flatten()

Row-Major vs Column-Major#

Note the output of .flatten() or using .flat with a for loop. This tells us something about the way NumPy arrays are stored in computer memory. As you can see, by default two-dimensional NumPy arrays are stored in memory row by row. When converting from a two-dimensional array to a one-dimensional array, we first get all the elements from the first row (in order), then all the elements from the second row and so on.

In computational jargon this is called row-major order. Row-major order is the default in the C and C++ programming languages and also in Python, but not in MATLAB or Fortran.

MATLAB and Fortran store the elements of a two-dimensional matrix in memory column by column. This is called column-major order.

Advanced MATLAB or Fortran users need to keep this difference in mind when using NumPy. However, when column-major (MATLAB-esque) behavior is desired, the default can be overwritten using the order flag. This optional named argument is present in all NumPy functions that rely on the actual representation of the array in memory (like the functions for flattening an array).

order='F' results in Fortran-like column-major behavior
order='C' results in C-like row-major behavior (which is also the default)

The Wikipedia article on this provides a good overview if you are interested in learning more: https://en.wikipedia.org/wiki/Row-_and_column-major_order

a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

a.flatten()

a.flatten(order='F')

Advanced: Mapping#

NumPy also provides functionality similar to the map() function in Python to apply a function to every element in a NumPy array. However, it is somewhat less straightforward. Instead of providing an interface to apply a scalar function tn every element like the Python map() function does, NumPy provides us with np.vectorize() that takes a scalar function and converts it to a new vectorized function that works with NumPy arrays.

# function that adds 42 to a number
def add_42(x):
    return x + 42

# vectorized version of function above
vectorized_add_42 = np.vectorize(add_42)

# now we can apply the function on a whole NumPy array
a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

vectorized_add_42(a)

To avoid having to define redundant functions for simple operations, we can also use lambda with np.vectorize().

np.vectorize(lambda x : x + 42)(a)

Copy vs View#

Let’s say we have the following matrix a.

a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

And let’s say we want to extract the bottom-right-most 2x2 elements from this matrix as a separate matrix b. The first thing that comes to mind would be to just use slicing with the assignment operator =.

b = a[1:, 1:]

print(b)

Now let’s modify the upper-left-most element of b.

b[0, 0] = 0

print(b)

After playing around with b and modifying its values we want to go back to a and take a look at the original values again.

print(a)

The values in our original matrix a have also changed!

That is because most NumPy operations return a view of the original array instead of a copy. This is computationally more efficient and allows NumPy to preform fast operations even on really large and complex arrays because the data is never copied over in computer memory. Instead we are shown the same array stored in memory using a slightly different view (you can think of it as a window) that perhaps blocks out some elements and changes the order of others. Most operations in NumPy, including all indexing and slicing operations, result in a different view of the original array, never a copy.

However, this is not how MATLAB handles things. In MATLAB, most operations result in a copy of the original array, allowing you to modify the outputs of various operations without having to worry about changing the original data. Hence, avid MATLAB users must keep in mind that this is not the case in NumPy to avoid unintentionally overwriting data.

It is also crucial to note that this behavior of returning a view is not universal in NumPy. While most operations return a view some might return a copy. Furthermore, due to the optimizing behavior of Python, in some cases the same function or operation might sometimes return a view and other times return a copy, depending on the input and whatever is most efficient at the time. Hence, you should always read the documentation of a function or method to know for sure whether it returns a view or a copy in your particular use case.

However, when using NumPy, it is safe to assume that everything returns a view unless explicitly asked otherwise. To ensure you are working with a copy in NumPy, use the .copy() method.

a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

b = a[1:, 1:].copy()

print(a)

print(b)

This time, b is actually referring to a whole new array that is separate from a. Previously, when we did not specify the .copy() method, b was just an alias that referred to a specific section of a.

b[0, 0] = 0

print(b)

print(a)

Shape Manipulation#

NumPy makes it really easy to manipulate the shape of an array. Note that all of these manipulations just return a different view of the same array and do not actually create a new array or change data in computer memory.

a = np.array([[ 1,  2,  3,  4,  5,  6],
              [ 7,  8,  9, 10, 11, 12],
              [13, 14, 15, 16, 17, 18]])

a.reshape(2, 9)

a.reshape(6, 3)

To transpose a two-dimensional array, we can also use the .T attribute.

a.T

The .reshape() method can be easily combined with np.arange() or np.linspace().

a = np.arange(24).reshape(6,4)
print(a)

If we do not care about a particular axis/dimension, we can have NumPy infer one of the dimensions by denoting it with -1 in .reshape().

a.reshape(2, -1)

a.reshape(-1, 3)

Bonus: File I/O with NumPy#

Once you are more comfortable with Python and NumPy arrays, it is highly likely that at some point you will have the need or desire to do one of the following:

Work with a preexisting dataset or matrix or save your results for future use
Export your results for publication or to share them with a friend or colleague
Conduct some parts of the analysis using a different program (like MATLAB)

All of these situations involve writing a NumPy array to a file and/or reading a matrix from a file and saving it as a NumPy array. Luckily for you, NumPy has built-in functionality to accommodate this. Here is an overview of all the file I/O functions available in NumPy: https://numpy.org/doc/stable/reference/routines.io.html

Writing a NumPy Array to a File#

Let’s say we have an array a that we would like to export to a file for some reason.

a = np.random.random((5,5))

print(a)

One option would be to use np.save() which saves the array to a binary .npy file.

np.save('array1', a)

Now when you open the File Browser from the left-hand menu, you should now see array1.npy along with some other files.

If you downloaded this notebook and are running it on a local instance of Jupyter, the array1.npy file is now saved into the same folder containing this notebook. You can use your system file browser (Explorer or Finder) to locate the file and take a look.

If you are running this notebook using Binder or Google Colab, the array1.npy file is temporarily stored on the server running the notebook. You can view the file only using the built-in file browser accessible via the left-hand menu. Also note that this file along with any other files you might create will be deleted from the server after you close the notebook. You can download any files you would like to save on your computer by right-clicking on the file in the left-hand browser and then selecting Download.

Note that if you try opening array1.npy, it will not work. That is because the file is in binary format, meaning it is not human-readable and can only be deciphered by NumPy. Saving NumPy arrays in binary format is a good option if you care about speed and efficiency and are only planning on using NumPy to work with the data.

However, in many cases you might actually want to be able to see the contents of the file and use it with other programs like MATLAB. In that case, it makes much more sense to save the NumPy array as a human-readable text file. This can be done using np.savetxt().

np.savetxt('array2.txt', a)

Because array2.txt is a human-readable text file, you can take a look at it by opening it with a text editor or double-clicking on it in the built-in file explorer on the left. Note how by default, the values are separated by spaces and the numbers are formatted using scientific notation.

To change the separator between the array items in the outputted text file, we can use the delimiter argument. For example, to produce a CSV (Comma-Separated Values) file, we can specify delimiter = ",".

To change the formatting of the values themselves, we can use the fmt argument along with a format string. Format strings are quite complex and can be very confusing to beginners. Assuming you will only be working with floating point numbers, here is a simple formula: "%.[precision][f|e]"

precision is the number of decimal points or significant digits
f stands for floating-point notation
e stands for scientific notation

For example "%.9f" stands for floating-point notation with nine decimal points and "%.16e" stands for scientific notation with 16 significant digits (this is the default).

If you would like to tweak the formatting even more, you can generate more complex format strings following this specification: https://docs.python.org/3/library/string.html#format-specification-mini-language

np.savetxt('array3.csv', a, fmt='%.12f', delimiter=',')

Now take a look at array3.csv and see how setting the delimiter and formatting string have changed the appearance of the output.

Reading a NumPy Array from a File#

To read a binary .npy file into a NumPy array, we can use np.load().

b = np.load('array1.npy')

To read data from a text file into a NumPy array, we can use either np.loadtxt() or np.genfromtxt().

np.loadtxt() is an older function and provides very basic functionality
np.genfromtxt() is a newer and faster faster function that is more customizable and can handle missing values

Hence it is recommended you use np.genfromtxt() as a default. When using either function, you have to specify the delimiter argument if using anything other than whitespace.

A detailed guide on importing data with np.genfromtxt(): https://numpy.org/doc/stable/user/basics.io.genfromtxt.html

c = np.loadtxt('array2.txt')

d = np.genfromtxt('array3.csv', delimiter=',')

An important thing to note when saving floating-point arrays to text files is loss of significance. Because we can only store a set number of significant digits in the text file, it is possible that the number of significant digits will be reduced when writing data to a file, introducing round-off errors and causing precision loss.

Note that this is not the case when using the binary .npy format.

a == b

When writing to a text file using the default setting of scientific notation with 16 significant digits, precision loss does not occur under normal circumstances. However, note that this is dependent on the datatype of your array.

a == c

However, when specifying the number of decimal points or significant digits, or exporting with floating-point notation, precision loss is commonplace and very likely to occur.

a == d

Advanced: File I/O With Python#

But what exactly happens when we use np.genfromtxt() to read data from a file? We can get a high-level overview of the mechanisms that take place in the background when we try to recreate the functionality using standard Python.

First, we have to open the file in order to be able to read data from it.

file = open('array3.csv')

Now we have file object called file that gives us access to array3.csv. Using .readlines() with a file object, we can read all the lines from a file into a list.

lines = file.readlines()

lines

Now we have a list called lines, where each element is a line from the file array3.csv. Note that some cleaning needs to be done as these lines still contain whitespace characters like newlines.

cleaned_lines = []
for line in lines:
    line = line.strip()
    cleaned_lines.append(line)

cleaned_lines

The next step would be to convert each line to a list by splitting the string on the separator. This will lead to a list of lists, which is already quite similar to a two-dimensional NumPy array.

lists = []
for line in cleaned_lines:
    lst = line.split(',')
    lists.append(lst)

lists

Note how all the elements still have the type of str, meaning they are text, not numbers. Luckily there is an easy fix for that.

type(lists[0][0])

float_lists = []
for lst in lists:
    flst = []
    for element in lst:
        element = float(element)
        flst.append(element)
    float_lists.append(flst)

float_lists

type(float_lists[0][0])

Now we can use this list of lists to create a NumPy array.

e = np.array(float_lists)

We can confirm that we got the same result as we would have gotten using np.genfromtxt() by comparing it to the array d from before.

e == d

Finally we have to remember to close the file. This is very important to avoid any potential file corruption.

file.close()

Forgetting to close the file could lead to various issues and have serious consequences. Hence, it is commonplace to use open() in conjunction with a withstatement. Any code executed within the block defined by the with statement has access to the file and any code outside of the block does not. This reduces the potential for errors and does not require you to use manually close the connection to the file.

Also note how our previous processing involved looping over basically the same list numerous times. We can simplify this a little by looping over indices instead.

with open('array3.csv') as f:
    lines = f.readlines()

lines

for i in range(len(lines)):
    lines[i] = lines[i].strip().split(',')
    for j in range(len(lines[i])):
        lines[i][j] = float(lines[i][j])

lines

arr = np.array(lines)

arr

We can confirm that the result is indeed the same as before.

arr == e

Note that you can condense this even more by using map() with lambda and remembering that np.array() has a dtype argument.

with open('array3.csv') as f:
    arr2 = np.array(list(map(lambda x : x.strip().split(','), f.readlines())), dtype=float)

arr2

arr == arr2

However, as you can see, that already looks quite complicated and confusing. Plus, it is kind of ridiculous and completely unnecessary. Of course the easiest and most compact option would be to use np.genfromtxt() and that is what you should be using when attempting to read data from a text file into a NumPy array. As the saying goes, there is no point in reinventing the wheel.

However, if you ever feel the need (or desire) to read a file line by line using Python, remember that a combination of with, open() and .readlines() is the easiest option.

Quick Overview of Matplotlib#

Matplotlib is the primary plotting library in Python and it is designed to resemble the plotting functionalities of MATLAB. While it provides all kinds of different plotting functionality, the matplotlib.plyplot module is used the most. It is common to import this module under the alias plt.

import matplotlib.pyplot as plt

Matplotlib works in a layered fashion. First you define your plot using plt.plot(x, y, ...), then you can use additional plt methods to add more layers to your plot or modify its appearance. Finally, you use plt.show() to show the plot or plt.savefig() to save it to an external file. Let’s see how Matplotlib works in practice by creating some trigonometric plots.

x = np.linspace(0, 2*np.pi, num=20)
y = np.sin(x)

plt.plot(x, y)
plt.show()

plt.plot() takes additional arguments that modify the appearance of the plot. See the documentation for details: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html

# we can specify the style of the plot using named arguments
plt.plot(x, y, color='red', linestyle='--', marker='o')
plt.show()

# or we could use a shorthand string
plt.plot(x, y, 'r--o')
plt.show()

We can easily add additional layers and stylistic elements to the plot.

plt.plot(x, y, 'r--o')
plt.plot(x, np.cos(x), 'b-*')
plt.title('Sin and Cos')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['sin', 'cos'])
plt.show()

Note that if we only supply one array as an input to plt.plot(), it uses the values of the array as y values and uses the indices of the array as x values.

plt.plot([2, 3, 6, 4, 8, 9, 5, 7, 1])
plt.show()

If we want to create a figure with several subplots, we can use plt.subplots() to create a grid of subplots. It takes the dimensions of the subplot grid as input plt.subplots(rows, columns) and returns tow objects. The first is a figure object and the second is a NumPy array containing the subplots. In Matplotlib, subplots are often called axes.

# create a more fine-grained array to work with
a = np.linspace(0, 2*np.pi, num=100)

# create a two-by-two grid for our subplots
fig, ax = plt.subplots(2, 2)

# create subplots
ax[0, 0].plot(a, np.sin(a))     # upper-left
ax[0, 1].plot(a, np.cos(a))     # upper-right
ax[1, 0].plot(a, np.tan(a))     # bottom-left
ax[1, 1].plot(a, -a)            # bottom-right

# show figure
plt.show()

A more MATLAB-esque way of creating subplots would be to use the alternative plt.subplot() method. Using this method, you can define subplot using a three-number combination plt.subplot(rows, columns, index). The indexes of the subplots defined using this method increase in row-major order and, in true MATLAB fashion, begin with one.

plt.subplot(2, 2, 1)    # upper-left
plt.plot(a, np.sin(a))
plt.subplot(2, 2, 2)    # upper-right
plt.plot(a, np.cos(a))
plt.subplot(2, 2, 3)    # bottom-left
plt.plot(a, np.tan(a))
plt.subplot(2, 2, 4)    # bottom-right
plt.plot(a, -a)
plt.show()

Additional Resources#

This notebook only introduced the core components of NumPy and Matplotlib and did not include any hands-on exercises. If you would also like to learn about some slightly more advanced aspects of NumPy and try your hand at some exercises involving NumPy and Matplotlib, check out the University of Helsinki Data Analysis with Python MOOC. Feel free to go through all of the content to get acquainted with all things Python and data analysis, but if you want to focus solely on NumPy and Matplotlib, check out these sections:

Basic NumPy: https://csmastersuh.github.io/data_analysis_with_python_2020/numpy.html
Advanced NumPy: https://csmastersuh.github.io/data_analysis_with_python_2020/numpy2.html
Matplotlib: https://csmastersuh.github.io/data_analysis_with_python_2020/matplotlib.html

Furthermore, the official NumPy documentation contains numerous tutorials and quickstart guides designed for users of different backgrounds:

Quickstart Tutorial: https://numpy.org/doc/stable/user/quickstart.html
NumPy Basics for Absolute Beginners: https://numpy.org/doc/stable/user/absolute_beginners.html
NumPy for MATLAB Users: https://numpy.org/doc/stable/user/numpy-for-matlab-users.html

The Matplotlib official documentation also contains multiple useful tutorials: https://matplotlib.org/tutorials/index.html