{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "vnvI1DEiVAe7" }, "source": [ "# Scientific Computing with Python using NumPy\n", "\n", "---\n", "\n", "**A Tufts University Data Lab Workshop**\\\n", "Written by Uku-Kaspar Uustalu\n", "\n", "\n", "Website: [tuftsdatalab.github.io/intro-numpy](https://tuftsdatalab.github.io/intro-numpy/)\\\n", "Contact: \\\n", "Resources: [datalab.tufts.edu](https://sites.tufts.edu/datalab/)\n", "\n", "Last updated: `2021-11-15`\n", "\n", "---" ] }, { "cell_type": "markdown", "metadata": { "id": "xBrY4uSiVAe8" }, "source": [ "## Introduction to NumPy\n", "\n", "**NumPy** is the primary scientific computing library in Python that adds support for fast and efficient handling of large multi-dimensional arrays (matrices). It resembles MATLAB both in its functionality and design, making it a popular open-source alternative to the proprietary numerical computing software. However, there are some key differences between NumPy and MATLAB that this tutorial will attempt to outline.\n", "\n", "The primary building block of NumPy is the `numpy.ndarray`, also known by and commonly referred to using the alias `numpy.array` or `np.array`. It is a homogeneous multi-dimensional (*n-dimensional*) array that is fixed in size. (Note that this is somewhat different from MATLAB that defaults to 2D matrices.) The numpy `np.array` differs from a regular Python `list` in three key ways:\n", "\n", "1. All elements in a `np.array` must be the same datatype. (For example, only integers or only floating-point numbers, but never both.)\n", "2. The datatype of an `np.array` or its elements cannot be changed in-place. To do so, a new array must be created and elements have to be copied over and re-cast into a different datatype.\n", "2. The size of a `np.array` is fixed - new elements cannot be added nor can elements be removed. However, elements can be overwritten as long as you keep the same datatype.\n", "\n", "*Note that Python also contains a type of array called `array.array` which supports only one dimension and hence is very different from the NumPy `np.array`.*" ] }, { "cell_type": "markdown", "metadata": { "id": "W-w9YXunVAe8" }, "source": [ "---\n", "\n", "### Importing NumPy\n", "\n", "To start working with NumPy in Python, we first need to import it. As all NumPy elements are actually objects of the library itself, one needs to type out the name of the library all the time when using it. (This applies to most libraries in Python.) Hence, it is common to import numpy under the alias `np` to avoid having to type out the full `numpy` over and over again." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yuZ66730VAe9" }, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "id": "orSBGzKwVAe9" }, "source": [ "---\n", "\n", "## NumPy Arrays\n", "The easiest way to create a NumPy `np.array` is to use the `np.array()` constructor with an existing Python `list` or `tuple`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4Deftn1oVAe9" }, "outputs": [], "source": [ "np.array([1, 2, 3])" ] }, { "cell_type": "markdown", "metadata": { "id": "GWo18pf2VAe9" }, "source": [ "To create a multi-dimensional `np.array`, we can simply use a nested `list` or `tuple`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mXwE3UEbVAe9" }, "outputs": [], "source": [ "np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Y1u5BcaKVAe-" }, "outputs": [], "source": [ "np.array((((1, 2), (3, 4)), ((5, 6), (7, 8)), ((9, 10), (11, 12))))" ] }, { "cell_type": "markdown", "metadata": { "id": "BiI3ri0nVAe-" }, "source": [ "Note that the `np.array()` constructor expects a ***single*** `list` or `tuple` as the first argument. A common error, especially for MATLAB users, is to call `np.array()` with multiple arguments (separate array elements) instead of a ***singe*** `list` or `tuple` containing ***all*** the elements of the desired array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NjBo9PvuVAe-" }, "outputs": [], "source": [ "# this will result in an error\n", "np.array(1, 2, 3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3X_QsmIEVAe-" }, "outputs": [], "source": [ "# instead you should pass a *single* list as the first argument\n", "np.array([1, 2, 3])" ] }, { "cell_type": "markdown", "metadata": { "id": "6k2a3bIdVAe-" }, "source": [ "If desired, we can use the `dtype` argument to set the datatype of the `np.array` upon creation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GjNVqFGxVAe-" }, "outputs": [], "source": [ "np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=float)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8tXMYgqTVAe-" }, "outputs": [], "source": [ "np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=str)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LG--dcV5VAe-" }, "outputs": [], "source": [ "np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=complex)" ] }, { "cell_type": "markdown", "metadata": { "id": "6gqvYI24VAe-" }, "source": [ "---\n", "\n", "### Array Attributes\n", "\n", "NumPy arrays have multiple useful attributes that give us information about the array. Here are some of the most common and useful ones.\n", "\n", "- `.size` gives you the **number of elements** in the array\n", "- `.shape` gives you the **dimensions** (size in each dimension) of the array\n", "- `.ndim` gives you the **number of dimensions** (dimensionality) of the array\n", "- `.dtype` gives you the **datatype** of the array\n", "\n", "Note that in MATLAB `size` gives you the **dimensions** of an array while in NumPy it gives you the **number of elements**. Hence, MATLAB users should take care to use `shape` to get the **dimensions** of an array instead of `size`.\n", "\n", "*For additional NumPy array attributes, check out the documentation: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "igM8u6ZsVAe-" }, "outputs": [], "source": [ "a = np.array([[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]],\n", " [[13, 14, 15, 16], [17, 18, 19, 20], [21, 22, 23, 24]]])\n", "print(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gdgN7OJpVAe-" }, "outputs": [], "source": [ "# number of elements\n", "a.size" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zT7gIFD2VAe-" }, "outputs": [], "source": [ "# dimensions\n", "a.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hazGu-YUVAe-" }, "outputs": [], "source": [ "# number of dimensions\n", "a.ndim" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mhvlKekBVAe_" }, "outputs": [], "source": [ "# datatype\n", "a.dtype" ] }, { "cell_type": "markdown", "metadata": { "id": "j7IaMNyCVAe_" }, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": { "id": "K18Bm7XHVAe_" }, "source": [ "### Other ways of Creating Arrays\n", "\n", "Often we might want to create an empty array with desired dimensions or initialize an array with a common value. NumPy contains a lot of helpful constructors for that.\n", "\n", "- `np.zeros()` creates an array and fills it with zeros\n", "- `np.ones()` creates an array and fills it with ones\n", "- `np.full()` creates an array and sets all elements to a specified value\n", "\n", "For a one-dimensional array, you just pass the number of elements to these constructors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wfQFXdn_VAe_" }, "outputs": [], "source": [ "np.zeros(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tio-rHkGVAe_" }, "outputs": [], "source": [ "np.ones(4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Y7Q5HdqzVAe_" }, "outputs": [], "source": [ "np.full(5, fill_value=7)" ] }, { "cell_type": "markdown", "metadata": { "id": "z9jTgA5oVAe_" }, "source": [ "If we would like a multi-dimensional array instead, we simply pass a `tuple` or `list` with the dimensions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "70w0Pou9VAe_" }, "outputs": [], "source": [ "np.zeros((2, 3))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dZ4Hup1xVAe_" }, "outputs": [], "source": [ "np.ones((2, 3, 4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nKH_wpGlVAe_" }, "outputs": [], "source": [ "np.full([3, 2], fill_value=7)" ] }, { "cell_type": "markdown", "metadata": { "id": "62kSlog8VAe_" }, "source": [ "Note how the order of the dimensions in the tuple is from highest to lowest. This order of dimensions applies across all of NumPy.\n", "- `(2, 3)` creates an array with two rows and three elements in each row\n", "- `(2, 3, 4)` creates an array with two panes with each pane consisting of three rows, each of which have four elements\n", "\n", "Also note how both `np.zeros()` and `np.ones()` create an array with floating-point elements by default. If we want an integer array, we have to specify that using the `dtype` argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vIbW4Hp9VAe_" }, "outputs": [], "source": [ "np.zeros((2, 3), dtype=int)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f8kknK9_VAe_" }, "outputs": [], "source": [ "np.ones([2, 3, 4], dtype=int)" ] }, { "cell_type": "markdown", "metadata": { "id": "aSpo37DzVAfA" }, "source": [ "There also exists a constructor `np.empty()` that creates an ***uninitialized*** array of specified dimensions. This means that it allocates space in the computer memory for this array, but does not change the contents of said memory. The result is not an array one would usually consider *empty*. Instead you get an array filled with random garbage, also commonly referred to as *dead squirrels* in computing jargon. Basically, an array created with `np.empty()` contains the numeric representation of **whatever was in memory before the creation of the array**.\n", "\n", "The reason one would use `np.empty()` would be to quickly create an array that will be **completely filled** with totally new values later on. It should definitely not be used in cases where an empty array filled with zeros is desired as a starting point for a simulation or to iteratively compute array element values based on neighboring values. If you would like an array filled with zeros (that most would consider *empty*), be sure to use `np.zeros()` instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8XcKGJh7VAfA" }, "outputs": [], "source": [ "np.empty([5, 6])" ] }, { "cell_type": "markdown", "metadata": { "id": "QP-VR8m1VAfA" }, "source": [ "To create a sequence of elements, use `np.arange()` or `np.linspace()`.\n", "\n", "- `np.arange()` works similarly to the built-in Python `range()` function and should be used for **integer** sequences\n", "- `np.linspace()` takes a start and end point and the ***number*** of elements desired (instead of a step) and is more suitable for use with **floating-point** numbers" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kThdf8TbVAfB" }, "outputs": [], "source": [ "np.arange(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-90p7LDEVAfB" }, "outputs": [], "source": [ "np.arange(-10, 10, 2)" ] }, { "cell_type": "markdown", "metadata": { "id": "GG6QfoQiVAfE" }, "source": [ "Note how `np.arange()` uses a half-open interval *`[start, stop)`* just like the normal `range()` function, meaning the the *`stop`* is excluded from the output." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pfqto7L7VAfE" }, "outputs": [], "source": [ "np.arange(11)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Cbvs0PqaVAfE" }, "outputs": [], "source": [ "np.arange(-10, 11, 2)" ] }, { "cell_type": "markdown", "metadata": { "id": "Kv0YS5PTVAfE" }, "source": [ "While possible, it is not recommended to use `np.arange()` with **floating-point** numbers as is it difficult to predict the final number of elements. (This is due to the *somewhat imprecise* way floating-point numbers are stored in computer memory.) Hence, it is recommended you use `np.linspace()` when working with floating-point numbers. (Note that `np.linspace()` can also be used with integers.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "cE8ajzrCVAfE" }, "outputs": [], "source": [ "np.linspace(0, np.pi, 20)" ] }, { "cell_type": "markdown", "metadata": { "id": "i6YYl1uBVAfE" }, "source": [ "There are other more niche constructors in NumPy. For example, you can use `np.eye()` to create a unit matrix." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "b2WvbhygVAfE" }, "outputs": [], "source": [ "np.eye(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "nLxO5xH2VAfE" }, "source": [ "---\n", "\n", "### Creating Random Arrays\n", "\n", "Note that although `np.empty()` gives you an array with *random garbage*, the values themselves might not be **random** at all. To create an array with truly random values, we should use constructors from `np.random`.\n", "\n", "- `np.random.random()` gives you an array with random elements uniformly distributed in an half-open range of *`[0.0, 1.0)`*\n", "- `np.random.normal()` allows you to specify the mean and standard deviation of the uniform distribution of the random elements" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "awZMn3bHVAfE" }, "outputs": [], "source": [ "np.random.random((3, 4))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xRQ4UXl-VAfE" }, "outputs": [], "source": [ "np.random.normal(0, 1, (3, 4)) # mean of zero and a standard deviation of one" ] }, { "cell_type": "markdown", "metadata": { "id": "eUGpapjPVAfE" }, "source": [ "Re-run the blocks above and note how the random numbers change. While this is useful in some situations, it is detrimental to reproducibility. Often we would like to create some random values ***once*** and then use those same randomly generated values throughout our research or project. Because of this, you should always set a **random seed** when working with random variables. Like this, the randomness is determined by the seed and you will always get the same random numbers when using the same seed, allowing you to share your work and reproduce your results. A random seed could be any *integer*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mwVuBsarVAfE" }, "outputs": [], "source": [ "np.random.seed(42)\n", "np.random.random((3, 4))" ] }, { "cell_type": "markdown", "metadata": { "id": "NX5Z4RdTVAfE" }, "source": [ "Now re-run the block above and note how the random numbers remain the same across different runs." ] }, { "cell_type": "markdown", "metadata": { "id": "WJilDCK4VAfE" }, "source": [ "---\n", "\n", "## Basic Operations\n", "\n", "All arithmetic operators on NumPy arrays apply ***element-wise***." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "crplyWxzVAfE" }, "outputs": [], "source": [ "a = np.array([[ 1, 2, 3],\n", " [ 4, 5, 6],\n", " [ 7, 8, 9]])\n", "\n", "b = np.array([[10, 11, 12],\n", " [13, 14, 15],\n", " [16, 17, 18]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rorki2Q7VAfE" }, "outputs": [], "source": [ "a + 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "J8qpYMtVVAfE" }, "outputs": [], "source": [ "b - 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "U376jYx6VAfE" }, "outputs": [], "source": [ "a + b" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "roe-tfVZVAfE" }, "outputs": [], "source": [ "a - b" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AoidY74lVAfF" }, "outputs": [], "source": [ "a * 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8O9ndCEcVAfF" }, "outputs": [], "source": [ "a * b" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DH_EpvWGVAfF" }, "outputs": [], "source": [ "b / 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "i-enIJiHVAfF" }, "outputs": [], "source": [ "b / a" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gl9943oWVAfF" }, "outputs": [], "source": [ "b // a" ] }, { "cell_type": "markdown", "metadata": { "id": "IVbMv-RvVAfF" }, "source": [ "Note how the `*` operator does ***element-wise*** multiplication in NumPy while in MATLAB it does matrix multiplication.\n", "\n", "To do ***matrix multiplication*** in NumPy, you must use the `.dot()` method or the `@` operator." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "He6n0QCCVAfF" }, "outputs": [], "source": [ "np.dot(a, b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "c1bii6J_VAfF" }, "outputs": [], "source": [ "a.dot(b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "t4bg5gtcVAfF" }, "outputs": [], "source": [ "a @ b" ] }, { "cell_type": "markdown", "metadata": { "id": "R73qUUA8VAfF" }, "source": [ "Logical operators also apply ***element-wise*** in NumPy and return a *boolean* array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pPweq0c0VAfF" }, "outputs": [], "source": [ "a > b" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "V9vjKCOjVAfF" }, "outputs": [], "source": [ "a < b" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f_SpoM0jVAfF" }, "outputs": [], "source": [ "a == 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "G1Giw2JbVAfF" }, "outputs": [], "source": [ "# true if element is even\n", "a % 2 == 0" ] }, { "cell_type": "markdown", "metadata": { "id": "NxFH00vlVAfF" }, "source": [ "---\n", "\n", "### Broadcasting" ] }, { "cell_type": "markdown", "metadata": { "id": "MDP252odVAfF" }, "source": [ "Note how we could easily do both arithmetic and logical operations between NumPy arrays and scalar values (single numbers). This is because NumPy does something called ***broadcasting***, which allows you to do arithmetic operations between arrays and scalars and also between arrays of different but compatible dimensions. Basically, numpy will copy over the single value or either of the arrays as many times as needed in order to match up the dimensions.\n", "\n", "*You can learn more about broadcasting here: https://numpy.org/doc/stable/user/basics.broadcasting.html*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "hvo9_x5HVAfF" }, "outputs": [], "source": [ "a = np.array([[1, 1, 1],\n", " [1, 1, 1],\n", " [1, 1, 1]])\n", "\n", "b = np.array([2, 3, 4])\n", "\n", "c = np.array([[2],\n", " [3],\n", " [4]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pSCm3UVHVAfF" }, "outputs": [], "source": [ "a * b" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mEqyoGjMVAfF" }, "outputs": [], "source": [ "a * c" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qny7r-ofVAfF" }, "outputs": [], "source": [ "b * c" ] }, { "cell_type": "markdown", "metadata": { "id": "i6FIVD-WVAfF" }, "source": [ "---\n", "\n", "### Modifying Arrays In-Place" ] }, { "cell_type": "markdown", "metadata": { "id": "UbxCmgXJVAfF" }, "source": [ "NumPy also allows the use of ***in-place*** operators like `+=` and `*=` that should be well-familiar to C/C++ programmers. These allow you modify the values of the array *in-place* without returning a new array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eKH6yVb8VAfF" }, "outputs": [], "source": [ "a = np.array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9ZcsJOGwVAfF" }, "outputs": [], "source": [ "a += 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SQat_3TlVAfF" }, "outputs": [], "source": [ "print(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8HxKj99HVAfF" }, "outputs": [], "source": [ "a *= 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YuzQ8NiQVAfG" }, "outputs": [], "source": [ "print(a)" ] }, { "cell_type": "markdown", "metadata": { "id": "DRuVtCphVAfG" }, "source": [ "---\n", "\n", "## Universal Functions\n", "\n", "NumPy also contains numerous functions for common mathematical operations These operate ***element-wise*** and also follow the *broadcasting* behavior described above. In NumPy documentation they are called *universal functions* and also commonly referred to as *ufuncs*.\n", "\n", "*You can find a list of all available universal functions here: https://numpy.org/doc/stable/reference/ufuncs.html*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2TdL6PDEVAfG" }, "outputs": [], "source": [ "a = np.array([[ 1, 2, 3],\n", " [ 4, 5, 6],\n", " [ 7, 8, 9]])\n", "\n", "b = np.array([[10, 11, 12],\n", " [13, 14, 15],\n", " [16, 17, 18]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "faYBTN7IVAfG" }, "outputs": [], "source": [ "np.add(a, b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nt03dfDkVAfG" }, "outputs": [], "source": [ "np.add(a, 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4MYktV26VAfG" }, "outputs": [], "source": [ "np.exp(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FlF-RTqBVAfG" }, "outputs": [], "source": [ "np.sqrt(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Y463KVfpVAfG" }, "outputs": [], "source": [ "np.sin(b)" ] }, { "cell_type": "markdown", "metadata": { "id": "GEPGmkH-VAfG" }, "source": [ "---\n", "\n", "## Aggregation Functions\n", "\n", "NumPy also contains some handy aggregation functions that either operate on the whole array or along a specified axis of the array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZHRq3BHTVAfG" }, "outputs": [], "source": [ "a = np.array([[1, 2],\n", " [5, 3],\n", " [4, 6]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TvdCLw-ZVAfG" }, "outputs": [], "source": [ "a.max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PdLnepsMVAfG" }, "outputs": [], "source": [ "a.min()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JCy1TvV4VAfG" }, "outputs": [], "source": [ "a.sum()" ] }, { "cell_type": "markdown", "metadata": { "id": "3JLWBV6UVAfG" }, "source": [ "To run these aggregations along a certain axis, you have to specify the axis number using the `axis` named argument. The axis numbers range from `0` to `ndim-1` with the axis of the lowest dimension having the number `0` and the axis of the highest dimension having the number `ndim-1`. However, most of the time you will be dealing with two-dimensional arrays, in which case it is good to just keep in mind the following.\n", "\n", "- `axis=0` preforms the operation **across rows** and results in a single output value for each column\n", "- `axis=1` preforms the operation **across columns** and results in a single output value for each row" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Uk3hTyTpVAfG" }, "outputs": [], "source": [ "a.max(axis=0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "R1mi6z3qVAfG" }, "outputs": [], "source": [ "a.max(axis=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "s2mDJHqxVAfG" }, "source": [ "---\n", "\n", "## Indexing and Slicing\n", "\n", "You can select elements or ranges of elements from NumPy arrays as you would from a built-in Python `list`. If you are a avid MATLAB user, just keep in mind these three key differences:\n", "\n", "1. Python uses **zero-based indexing**, meaning that the fist element of an array (or list) is at position zero.\n", "2. Square brackets **`[ ]`** are the indexing operator in Python.\n", "3. Negative indices count from the end, meaning that the **last** element of an array (or list) is at position `[-1]`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6a7omDqEVAfG" }, "outputs": [], "source": [ "a = np.array([[ 1, 2, 3, 4],\n", " [ 5, 6, 7, 8],\n", " [ 9, 10, 11, 12],\n", " [13, 14, 15, 16],\n", " [17, 18, 19, 20]])\n", "print(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9JwIaYrgVAfG" }, "outputs": [], "source": [ "a[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lgYFZjeGVAfG" }, "outputs": [], "source": [ "a[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "12VPMFroVAfG" }, "outputs": [], "source": [ "a[1:3]" ] }, { "cell_type": "markdown", "metadata": { "id": "j2qgIGnLVAfG" }, "source": [ "Remember that when using *`[start:end]`* to slice in Python, the *`end`* index is exclusive, meaning that the element at index *`end`* is not included in the slice.\n", "\n", "You can also use *`[start:end:step]`* with NumPy arrays. (Remember that omitting the *`start`* index means *slice from beginning* and omitting the *`end`* index means *slice until end*.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LXtCw49eVAfG" }, "outputs": [], "source": [ "a[::2]" ] }, { "cell_type": "markdown", "metadata": { "id": "DgN6cjeLVAfH" }, "source": [ "It is good to know that using `-1` as the step when slicing reverses the selection." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UNrUFIdVVAfH" }, "outputs": [], "source": [ "a[0][::-1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "EYK2DCsIVAfH" }, "outputs": [], "source": [ "a[::-1]" ] }, { "cell_type": "markdown", "metadata": { "id": "_duM9-OpVAfH" }, "source": [ "To access elements from multi-dimensional arrays, we can use **chained indexing**." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "g7z_B8tXVAfH" }, "outputs": [], "source": [ "a[-1][0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "G-ZzHxaTVAfH" }, "outputs": [], "source": [ "a[0][1:3]" ] }, { "cell_type": "markdown", "metadata": { "id": "yUWRNh-0VAfH" }, "source": [ "However, chained indexing has its limitations. For example, slicing a multi-dimensional array also returns a multi-dimensional array, often leading to confusion when using chained indexing.\n", "\n", "You can also index multi-dimensional NumPy arrays by including multiple comma-separated indices or ranges in the **`[ ]`** indexing operator, one for each dimension. It is recommended to use this approach as opposed to chained indexing. Note that the order of dimensions is again from highest to lowest." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "baA565SiVAfH" }, "outputs": [], "source": [ "# print out the matrix again for reference\n", "print(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BbiZAFaIVAfH" }, "outputs": [], "source": [ "# second element in first row\n", "a[0, 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p5Wg_R7XVAfH" }, "outputs": [], "source": [ "# first element in second row\n", "a[1, 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ylsavjJdVAfH" }, "outputs": [], "source": [ "# entire second row\n", "a[1, :]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "R8ol_pgrVAfH" }, "outputs": [], "source": [ "# entire second column\n", "a[:, 1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "V9EIMRXtVAfH" }, "outputs": [], "source": [ "# the last element from the second and third rows\n", "a[1:3, -1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0dNe-qvdVAfH" }, "outputs": [], "source": [ "# the middle 3x2 selection\n", "a[1:4, 1:3]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "w9nFqPU_VAfH" }, "outputs": [], "source": [ "# the upper-right-most 4x3 selection\n", "a[:4, 1:]" ] }, { "cell_type": "markdown", "metadata": { "id": "vWlBpCpXVAfH" }, "source": [ "You can use indexing to change single elements and slicing to change entire selections." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OY4RBUOmVAfH" }, "outputs": [], "source": [ "a[0, 0] = 0\n", "print(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gGzTxustVAfH" }, "outputs": [], "source": [ "a[3:, 2:] = 0\n", "print(a)" ] }, { "cell_type": "markdown", "metadata": { "id": "J1FT4ybfVAfH" }, "source": [ "---\n", "\n", "### Boolean Indexing" ] }, { "cell_type": "markdown", "metadata": { "id": "zVZ_C0JbVAfH" }, "source": [ "To select elements from a NumPy array you can also use a *boolean* array with the same exact dimensions as the array you are trying to select elements from. Every element where the corresponding element in the supplied *boolean* array is `True` gets selected. This is also known as ***boolean indexing***." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6sGrzi8jVAfH" }, "outputs": [], "source": [ "a = np.array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])\n", "\n", "b = np.array([[True, False, False],\n", " [False, True, False],\n", " [False, False, True]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1WNe1GWUVAfH" }, "outputs": [], "source": [ "a[b]" ] }, { "cell_type": "markdown", "metadata": { "id": "-AxaV39EVAfH" }, "source": [ "However, most of the time it is unrealistic to manually create a *boolean* array for indexing.\n", "\n", "Luckily we know from before that NumPy applies logical operators ***element-wise***, resulting in a *boolean* array. We can use this to easily select elements from an array based on a desired condition." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-MqbrxyaVAfH" }, "outputs": [], "source": [ "# get boolean array that is true if element is even\n", "a % 2 == 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "pQf4L9ptVAfH" }, "outputs": [], "source": [ "# extract all even elements\n", "a[a % 2 == 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NxcnKgDcVAfH" }, "outputs": [], "source": [ "# extract all odd elements\n", "a[a % 2 != 0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HT6CkfE3VAfI" }, "outputs": [], "source": [ "# extract all elements larger than the mean\n", "a[a > a.mean()]" ] }, { "cell_type": "markdown", "metadata": { "id": "g6wlwqzmVAfI" }, "source": [ "---\n", "\n", "## Iterating\n", "\n", "Iterating over multi-dimensional NumPy arrays is done with respect to the highest dimension (first axis)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kpgKtJb7VAfI" }, "outputs": [], "source": [ "a = np.array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zIbA4CISVAfI" }, "outputs": [], "source": [ "for row in a:\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": { "id": "C1czN4sjVAfI" }, "source": [ "To iterate over ***each element*** of a multi-dimensional array, one may use nested loops." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nAlZI3FkVAfI" }, "outputs": [], "source": [ "for row in a:\n", " for element in row:\n", " print(element)" ] }, { "cell_type": "markdown", "metadata": { "id": "abz46EgwVAfI" }, "source": [ "However, nested loops are often inefficient and could easily lead to confusion and unmaintainable code. Hence, it is recommended to avoid nested loops if possible.\n", "\n", "Luckily for you, NumPy includes functionality for easily iterating over all objects in a NumPy array. For example, you could use the `.flat` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1i1lDkvYVAfI" }, "outputs": [], "source": [ "for element in a.flat:\n", " print(element)" ] }, { "cell_type": "markdown", "metadata": { "id": "ebTyC9V3VAfI" }, "source": [ "Note that `.flat` returns an iterator. Basically, that is just something that tells Python how to iterate over ***all*** the elements of the array using a `for` loop. It does not actually return a flattened one-dimensional version of the original array. To do that, we can use the `.flatten()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Obo2KbvgVAfI" }, "outputs": [], "source": [ "a.flatten()" ] }, { "cell_type": "markdown", "metadata": { "id": "7W6s8rhbVAfI" }, "source": [ "---\n", "\n", "### Row-Major vs Column-Major\n", "\n", "Note the output of `.flatten()` or using `.flat` with a `for` loop. This tells us something about the way NumPy arrays are stored in computer memory. As you can see, by default two-dimensional NumPy arrays are stored in memory row by row. When converting from a two-dimensional array to a one-dimensional array, we first get all the elements from the first row (in order), then all the elements from the second row and so on.\n", "\n", "In computational jargon this is called ***row-major*** order. Row-major order is the default in the C and C++ programming languages and also in Python, but **not** in MATLAB or Fortran.\n", "\n", "MATLAB and Fortran store the elements of a two-dimensional matrix in memory column by column. This is called ***column-major*** order.\n", "\n", "Advanced MATLAB or Fortran users need to keep this difference in mind when using NumPy. However, when column-major (MATLAB-esque) behavior is desired, the default can be overwritten using the `order` flag. This optional named argument is present in all NumPy functions that rely on the actual representation of the array in memory (like the functions for flattening an array).\n", "\n", "- `order='F'` results in **Fortran**-like ***column-major*** behavior\n", "- `order='C'` results in **C**-like ***row-major*** behavior (which is also the default)\n", "\n", "*The Wikipedia article on this provides a good overview if you are interested in learning more: https://en.wikipedia.org/wiki/Row-_and_column-major_order*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4URGwh-FVAfI" }, "outputs": [], "source": [ "a = np.array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HCdVdbchVAfI" }, "outputs": [], "source": [ "a.flatten()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "emqVqtJCVAfI" }, "outputs": [], "source": [ "a.flatten(order='F')" ] }, { "cell_type": "markdown", "metadata": { "id": "JKjLC0LVVAfI" }, "source": [ "---\n", "\n", "### Advanced: Mapping\n", "\n", "NumPy also provides functionality similar to the `map()` function in Python to apply a function to every element in a NumPy array. However, it is somewhat less straightforward. Instead of providing an interface to apply a scalar function tn every element like the Python `map()` function does, NumPy provides us with `np.vectorize()` that takes a scalar function and converts it to a new vectorized function that works with NumPy arrays." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xrbQ9YBgVAfI" }, "outputs": [], "source": [ "# function that adds 42 to a number\n", "def add_42(x):\n", " return x + 42" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JkR3q_oOVAfI" }, "outputs": [], "source": [ "# vectorized version of function above\n", "vectorized_add_42 = np.vectorize(add_42)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1_oV2fpUVAfI" }, "outputs": [], "source": [ "# now we can apply the function on a whole NumPy array\n", "a = np.array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])\n", "\n", "vectorized_add_42(a)" ] }, { "cell_type": "markdown", "metadata": { "id": "jx3wvOz6VAfI" }, "source": [ "To avoid having to define redundant functions for simple operations, we can also use `lambda` with `np.vectorize()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "tFP2dQwqVAfI" }, "outputs": [], "source": [ "np.vectorize(lambda x : x + 42)(a)" ] }, { "cell_type": "markdown", "metadata": { "id": "w45JzcmjVAfI" }, "source": [ "---\n", "\n", "## Copy vs View\n", "\n", "Let's say we have the following matrix `a`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PTckq2TZVAfI" }, "outputs": [], "source": [ "a = np.array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, { "cell_type": "markdown", "metadata": { "id": "mCodjcgEVAfI" }, "source": [ "And let's say we want to extract the bottom-right-most 2x2 elements from this matrix as a separate matrix `b`. The first thing that comes to mind would be to just use slicing with the assignment operator `=`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6mbsYbkPVAfI" }, "outputs": [], "source": [ "b = a[1:, 1:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qfgQWnvKVAfI" }, "outputs": [], "source": [ "print(b)" ] }, { "cell_type": "markdown", "metadata": { "id": "vbm3-rx-VAfJ" }, "source": [ "Now let's modify the upper-left-most element of `b`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "n6XUyJHOVAfJ" }, "outputs": [], "source": [ "b[0, 0] = 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "SjeukO2WVAfJ" }, "outputs": [], "source": [ "print(b)" ] }, { "cell_type": "markdown", "metadata": { "id": "g_zJQGcAVAfJ" }, "source": [ "After playing around with `b` and modifying its values we want to go back to `a` and take a look at the original values again." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lPYPsCehVAfJ" }, "outputs": [], "source": [ "print(a)" ] }, { "cell_type": "markdown", "metadata": { "id": "Q8oKbo_tVAfJ" }, "source": [ "**The values in our original matrix `a` have also changed!**\n", "\n", "That is because most NumPy operations return a ***view*** of the original array instead of a copy. This is computationally more efficient and allows NumPy to preform fast operations even on really large and complex arrays because the data is never copied over in computer memory. Instead we are shown the same array stored in memory using a slightly different view (you can think of it as a window) that perhaps blocks out some elements and changes the order of others. **Most** operations in NumPy, including **all** indexing and slicing operations, result in a different ***view*** of the original array, never a copy.\n", "\n", "However, this is not how MATLAB handles things. In MATLAB, most operations result in a ***copy*** of the original array, allowing you to modify the outputs of various operations without having to worry about changing the original data. Hence, avid MATLAB users must keep in mind that this is not the case in NumPy to avoid unintentionally overwriting data.\n", "\n", "It is also crucial to note that this behavior of returning a ***view*** is not universal in NumPy. While **most** operations return a ***view*** some might return a ***copy***. Furthermore, due to the optimizing behavior of Python, in some cases the same function or operation might sometimes return a view and other times return a copy, depending on the input and whatever is most efficient at the time. Hence, you should **always read the documentation** of a function or method to know for sure whether it returns a view or a copy in your particular use case.\n", "\n", "However, when using NumPy, it is safe to assume that everything returns a ***view*** unless explicitly asked otherwise. To ensure you are working with a ***copy*** in NumPy, use the `.copy()` method." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "zabTGKK5VAfJ" }, "outputs": [], "source": [ "a = np.array([[1, 2, 3],\n", " [4, 5, 6],\n", " [7, 8, 9]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "X_n4C3PXVAfJ" }, "outputs": [], "source": [ "b = a[1:, 1:].copy()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "J2r5Dw7BVAfJ" }, "outputs": [], "source": [ "print(a)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5L2_yT6zVAfJ" }, "outputs": [], "source": [ "print(b)" ] }, { "cell_type": "markdown", "metadata": { "id": "MeeG_whpVAfJ" }, "source": [ "This time, `b` is actually referring to a whole new array that is separate from `a`. Previously, when we did not specify the `.copy()` method, `b` was just an alias that referred to a specific section of `a`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nny9V00cVAfJ" }, "outputs": [], "source": [ "b[0, 0] = 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6tSOcOENVAfJ" }, "outputs": [], "source": [ "print(b)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W6sxpNZhVAfJ" }, "outputs": [], "source": [ "print(a)" ] }, { "cell_type": "markdown", "metadata": { "id": "XGD3TGWrVAfJ" }, "source": [ "---\n", "## Shape Manipulation\n", "\n", "NumPy makes it really easy to manipulate the shape of an array. Note that all of these manipulations just return a different ***view*** of the same array and do not actually create a new array or change data in computer memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QX6kG1vxVAfJ" }, "outputs": [], "source": [ "a = np.array([[ 1, 2, 3, 4, 5, 6],\n", " [ 7, 8, 9, 10, 11, 12],\n", " [13, 14, 15, 16, 17, 18]])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "C3dw3cg_VAfJ" }, "outputs": [], "source": [ "a.reshape(2, 9)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TbBrBWzeVAfJ" }, "outputs": [], "source": [ "a.reshape(6, 3)" ] }, { "cell_type": "markdown", "metadata": { "id": "W4bJhfixVAfJ" }, "source": [ "To transpose a two-dimensional array, we can also use the `.T` attribute." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "M-hMZhh-VAfJ" }, "outputs": [], "source": [ "a.T" ] }, { "cell_type": "markdown", "metadata": { "id": "NbRCKIC5VAfJ" }, "source": [ "The `.reshape()` method can be easily combined with `np.arange()` or `np.linspace()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6z_GumYNVAfJ" }, "outputs": [], "source": [ "a = np.arange(24).reshape(6,4)\n", "print(a)" ] }, { "cell_type": "markdown", "metadata": { "id": "MNva7svRVAfK" }, "source": [ "If we do not care about a particular axis/dimension, we can have NumPy infer ***one*** of the dimensions by denoting it with `-1` in `.reshape()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "1nsgJggBVAfK" }, "outputs": [], "source": [ "a.reshape(2, -1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "DQR0z7JIVAfK" }, "outputs": [], "source": [ "a.reshape(-1, 3)" ] }, { "cell_type": "markdown", "metadata": { "id": "6arfMlBrVAfK" }, "source": [ "---\n", "\n", "## Bonus: File I/O with NumPy\n", "\n", "Once you are more comfortable with Python and NumPy arrays, it is highly likely that at some point you will have the need or desire to do one of the following:\n", "\n", "- Work with a preexisting dataset or matrix or save your results for future use\n", "- Export your results for publication or to share them with a friend or colleague\n", "- Conduct some parts of the analysis using a different program (like MATLAB)\n", "\n", "All of these situations involve writing a NumPy array to a file and/or reading a matrix from a file and saving it as a NumPy array. Luckily for you, NumPy has built-in functionality to accommodate this. Here is an overview of all the file I/O functions available in NumPy: https://numpy.org/doc/stable/reference/routines.io.html" ] }, { "cell_type": "markdown", "metadata": { "id": "t96L9bRhVAfK" }, "source": [ "---\n", "\n", "### Writing a NumPy Array to a File\n", "\n", "Let's say we have an array `a` that we would like to export to a file for some reason." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2AwGe3Q6VAfK" }, "outputs": [], "source": [ "a = np.random.random((5,5))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5a9zKwj6VAfK" }, "outputs": [], "source": [ "print(a)" ] }, { "cell_type": "markdown", "metadata": { "id": "Kqv7A5AXVAfK" }, "source": [ "One option would be to use `np.save()` which saves the array to a binary `.npy` file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "myaFcs-0VAfK" }, "outputs": [], "source": [ "np.save('array1', a)" ] }, { "cell_type": "markdown", "metadata": { "id": "K0Uoqg6mVAfK" }, "source": [ "Now when you open the **File Browser** from the left-hand menu, you should now see `array1.npy` along with some other files.\n", "\n", "If you **downloaded** this notebook and are running it on a local instance of Jupyter, the `array1.npy` file is now saved into the same folder containing this notebook. You can use your system file browser (*Explorer* or *Finder*) to locate the file and take a look.\n", "\n", "If you are running this notebook using **Binder** or **Google Colab**, the `array1.npy` file is temporarily stored on the server running the notebook. You can view the file *only* using the built-in file browser accessible via the left-hand menu. Also note that this file along with any other files you might create will be deleted from the server after you close the notebook. You can download any files you would like to save on your computer by *right-clicking* on the file in the left-hand browser and then selecting *Download*.\n", "\n", "Note that if you try opening `array1.npy`, it will not work. That is because the file is in **binary** format, meaning it is not human-readable and can only be deciphered by NumPy. Saving NumPy arrays in binary format is a good option if you care about speed and efficiency and are only planning on using NumPy to work with the data.\n", "\n", "However, in many cases you might actually want to be able to see the contents of the file and use it with other programs like MATLAB. In that case, it makes much more sense to save the NumPy array as a human-readable text file. This can be done using `np.savetxt()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LGtTGKMLVAfK" }, "outputs": [], "source": [ "np.savetxt('array2.txt', a)" ] }, { "cell_type": "markdown", "metadata": { "id": "FdhggZFAVAfK" }, "source": [ "Because `array2.txt` is a human-readable text file, you can take a look at it by opening it with a text editor or *double-clicking* on it in the built-in file explorer on the left. Note how by default, the values are separated by spaces and the numbers are formatted using scientific notation.\n", "\n", "To change the separator between the array items in the outputted text file, we can use the `delimiter` argument. For example, to produce a **CSV** (Comma-Separated Values) file, we can specify `delimiter = \",\"`.\n", "\n", "To change the formatting of the values themselves, we can use the `fmt` argument along with a ***format string***. Format strings are quite complex and can be very confusing to beginners. Assuming you will only be working with floating point numbers, here is a simple formula: `\"%.[precision][f|e]\"`\n", "\n", "- `precision` is the number of decimal points or significant digits\n", "- `f` stands for floating-point notation\n", "- `e` stands for scientific notation\n", "\n", "For example `\"%.9f\"` stands for floating-point notation with nine decimal points and `\"%.16e\"` stands for scientific notation with 16 significant digits (this is the default).\n", "\n", "If you would like to tweak the formatting even more, you can generate more complex format strings following this specification: https://docs.python.org/3/library/string.html#format-specification-mini-language" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rKqZALglVAfK" }, "outputs": [], "source": [ "np.savetxt('array3.csv', a, fmt='%.12f', delimiter=',')" ] }, { "cell_type": "markdown", "metadata": { "id": "zS0bMw0GVAfK" }, "source": [ "Now take a look at `array3.csv` and see how setting the delimiter and formatting string have changed the appearance of the output." ] }, { "cell_type": "markdown", "metadata": { "id": "ZkbuaUuBVAfK" }, "source": [ "---\n", "\n", "### Reading a NumPy Array from a File\n", "\n", "To read a binary `.npy` file into a NumPy array, we can use `np.load()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GQTwLZ_jVAfK" }, "outputs": [], "source": [ "b = np.load('array1.npy')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "KokN17gxVAfK" }, "outputs": [], "source": [ "b" ] }, { "cell_type": "markdown", "metadata": { "id": "m4yJhsW9VAfK" }, "source": [ "To read data from a text file into a NumPy array, we can use either `np.loadtxt()` or `np.genfromtxt()`.\n", "\n", "- `np.loadtxt()` is an older function and provides very basic functionality\n", "- `np.genfromtxt()` is a newer and **faster** faster function that is more customizable and can handle missing values\n", "\n", "Hence it is recommended you use `np.genfromtxt()` as a default. When using either function, you have to specify the `delimiter` argument if using anything other than whitespace.\n", "\n", "A detailed guide on importing data with `np.genfromtxt()`: https://numpy.org/doc/stable/user/basics.io.genfromtxt.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZgDFAFIPVAfK" }, "outputs": [], "source": [ "c = np.loadtxt('array2.txt')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "E2RtmV3IVAfK" }, "outputs": [], "source": [ "c" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "mpMfxB-hVAfL" }, "outputs": [], "source": [ "d = np.genfromtxt('array3.csv', delimiter=',')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nlAQhEfvVAfL" }, "outputs": [], "source": [ "d" ] }, { "cell_type": "markdown", "metadata": { "id": "8_GureZPVAfL" }, "source": [ "An important thing to note when saving floating-point arrays to text files is ***loss of significance***. Because we can only store a set number of significant digits in the text file, it is possible that the number of significant digits will be reduced when writing data to a file, introducing round-off errors and causing precision loss.\n", "\n", "Note that this is not the case when using the binary `.npy` format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "wod2Lg0TVAfL" }, "outputs": [], "source": [ "a == b" ] }, { "cell_type": "markdown", "metadata": { "id": "w5-EIqD6VAfL" }, "source": [ "When writing to a text file using the default setting of scientific notation with 16 significant digits, precision loss does not occur under normal circumstances. However, note that this is dependent on the *datatype* of your array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "H5telr-LVAfL" }, "outputs": [], "source": [ "a == c" ] }, { "cell_type": "markdown", "metadata": { "id": "xyv31xJwVAfL" }, "source": [ "However, when specifying the number of decimal points or significant digits, or exporting with floating-point notation, precision loss is commonplace and very likely to occur." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AEaJ9ek2VAfL" }, "outputs": [], "source": [ "a == d" ] }, { "cell_type": "markdown", "metadata": { "id": "xBkzngJHVAfL" }, "source": [ "---\n", "\n", "### Advanced: File I/O With Python\n", "\n", "But what exactly happens when we use `np.genfromtxt()` to read data from a file? We can get a high-level overview of the mechanisms that take place in the background when we try to recreate the functionality using standard Python.\n", "\n", "First, we have to open the file in order to be able to read data from it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "n0bj2cUpVAfL" }, "outputs": [], "source": [ "file = open('array3.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "TQPbV-h3VAfL" }, "source": [ "Now we have **file object** called `file` that gives us access to `array3.csv`. Using `.readlines()` with a file object, we can read all the lines from a file into a list." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "r0ExbFNHVAfL" }, "outputs": [], "source": [ "lines = file.readlines()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Gn_IhkYhVAfL" }, "outputs": [], "source": [ "lines" ] }, { "cell_type": "markdown", "metadata": { "id": "Q7OOzKw2VAfL" }, "source": [ "Now we have a list called `lines`, where each element is a line from the file `array3.csv`. Note that some cleaning needs to be done as these lines still contain whitespace characters like newlines." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eW4IuHdpVAfL" }, "outputs": [], "source": [ "cleaned_lines = []\n", "for line in lines:\n", " line = line.strip()\n", " cleaned_lines.append(line)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YglCfhQaVAfL" }, "outputs": [], "source": [ "cleaned_lines" ] }, { "cell_type": "markdown", "metadata": { "id": "06Xad2ntVAfL" }, "source": [ "The next step would be to convert each line to a list by splitting the string on the separator. This will lead to a list of lists, which is already quite similar to a two-dimensional NumPy array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "2FTja6_HVAfL" }, "outputs": [], "source": [ "lists = []\n", "for line in cleaned_lines:\n", " lst = line.split(',')\n", " lists.append(lst)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bdpxPJgjVAfL" }, "outputs": [], "source": [ "lists" ] }, { "cell_type": "markdown", "metadata": { "id": "t7iURldjVAfL" }, "source": [ "Note how all the elements still have the type of `str`, meaning they are text, not numbers. Luckily there is an easy fix for that." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ERzO2VVvVAfL" }, "outputs": [], "source": [ "type(lists[0][0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "v0tKnrxKVAfL" }, "outputs": [], "source": [ "float_lists = []\n", "for lst in lists:\n", " flst = []\n", " for element in lst:\n", " element = float(element)\n", " flst.append(element)\n", " float_lists.append(flst)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Xjvr-jSjVAfL" }, "outputs": [], "source": [ "float_lists" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "LwXvLCM5VAfL" }, "outputs": [], "source": [ "type(float_lists[0][0])" ] }, { "cell_type": "markdown", "metadata": { "id": "n1crG9tmVAfM" }, "source": [ "Now we can use this list of lists to create a NumPy array." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kDwzfRrnVAfM" }, "outputs": [], "source": [ "e = np.array(float_lists)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gVuC5HUdVAfM" }, "outputs": [], "source": [ "e" ] }, { "cell_type": "markdown", "metadata": { "id": "7h-H3qzzVAfM" }, "source": [ "We can confirm that we got the same result as we would have gotten using `np.genfromtxt()` by comparing it to the array `d` from before." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eTSFUsQSVAfM" }, "outputs": [], "source": [ "e == d" ] }, { "cell_type": "markdown", "metadata": { "id": "z82CfYUsVAfM" }, "source": [ "Finally we have to remember to close the file. This is very important to avoid any potential file corruption." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HdVjGrZ5VAfM" }, "outputs": [], "source": [ "file.close()" ] }, { "cell_type": "markdown", "metadata": { "id": "lR4hNeY2VAfM" }, "source": [ "Forgetting to close the file could lead to various issues and have serious consequences. Hence, it is commonplace to use `open()` in conjunction with a `with`statement. Any code executed within the block defined by the `with` statement has access to the file and any code outside of the block does not. This reduces the potential for errors and does not require you to use manually close the connection to the file.\n", "\n", "Also note how our previous processing involved looping over basically the same list numerous times. We can simplify this a little by looping over indices instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "of1L-ee7VAfM" }, "outputs": [], "source": [ "with open('array3.csv') as f:\n", " lines = f.readlines()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ph8dZJsMVAfM" }, "outputs": [], "source": [ "lines" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jHoG_aIPVAfM" }, "outputs": [], "source": [ "for i in range(len(lines)):\n", " lines[i] = lines[i].strip().split(',')\n", " for j in range(len(lines[i])):\n", " lines[i][j] = float(lines[i][j])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ttDPeRL1VAfM" }, "outputs": [], "source": [ "lines" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZoqODv6rVAfM" }, "outputs": [], "source": [ "arr = np.array(lines)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "iaFCEb1KVAfM" }, "outputs": [], "source": [ "arr" ] }, { "cell_type": "markdown", "metadata": { "id": "DVS7NQ7WVAfM" }, "source": [ "We can confirm that the result is indeed the same as before." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "5Eb9Pju1VAfM" }, "outputs": [], "source": [ "arr == e" ] }, { "cell_type": "markdown", "metadata": { "id": "_T-tR-pBVAfM" }, "source": [ "Note that you can condense this even more by using `map()` with `lambda` and remembering that `np.array()` has a `dtype` argument." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_2tx2K02VAfM" }, "outputs": [], "source": [ "with open('array3.csv') as f:\n", " arr2 = np.array(list(map(lambda x : x.strip().split(','), f.readlines())), dtype=float)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dkE_awxSVAfM" }, "outputs": [], "source": [ "arr2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "VdzlmkymVAfM" }, "outputs": [], "source": [ "arr == arr2" ] }, { "cell_type": "markdown", "metadata": { "id": "Ebq9aXW5VAfM" }, "source": [ "However, as you can see, that already looks quite complicated and confusing. Plus, it is kind of ridiculous and completely unnecessary. Of course the easiest and most compact option would be to use `np.genfromtxt()` and that is what you should be using when attempting to read data from a text file into a NumPy array. As the saying goes, there is no point in reinventing the wheel.\n", "\n", "However, if you ever feel the need (or desire) to read a file line by line using Python, remember that a combination of `with`, `open()` and `.readlines()` is the easiest option." ] }, { "cell_type": "markdown", "metadata": { "id": "cjFGI1dzVAfM", "tags": [] }, "source": [ "---\n", "\n", "## Quick Overview of Matplotlib\n", "\n", "Matplotlib is the primary plotting library in Python and it is designed to resemble the plotting functionalities of MATLAB. While it provides all kinds of different plotting functionality, the `matplotlib.plyplot` module is used the most. It is common to import this module under the alias `plt`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ov9wpNEzVAfM" }, "outputs": [], "source": [ "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "id": "6HwiRSAcVAfM" }, "source": [ "Matplotlib works in a layered fashion. First you define your plot using `plt.plot(x, y, ...)`, then you can use additional `plt` methods to add more layers to your plot or modify its appearance. Finally, you use `plt.show()` to show the plot or `plt.savefig()` to save it to an external file. Let's see how Matplotlib works in practice by creating some trigonometric plots." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "H6N8VixLVAfN" }, "outputs": [], "source": [ "x = np.linspace(0, 2*np.pi, num=20)\n", "y = np.sin(x)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "denISA3kVAfN" }, "outputs": [], "source": [ "plt.plot(x, y)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "J20M65UEVAfN" }, "source": [ "`plt.plot()` takes additional arguments that modify the appearance of the plot. See the documentation for details: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_3nRwHM8VAfN" }, "outputs": [], "source": [ "# we can specify the style of the plot using named arguments\n", "plt.plot(x, y, color='red', linestyle='--', marker='o')\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JjIInTpxVAfN" }, "outputs": [], "source": [ "# or we could use a shorthand string\n", "plt.plot(x, y, 'r--o')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "Y_HaDtBPVAfN" }, "source": [ "We can easily add additional layers and stylistic elements to the plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-lE0CgyuVAfN" }, "outputs": [], "source": [ "plt.plot(x, y, 'r--o')\n", "plt.plot(x, np.cos(x), 'b-*')\n", "plt.title('Sin and Cos')\n", "plt.xlabel('x')\n", "plt.ylabel('y')\n", "plt.legend(['sin', 'cos'])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "WNaTyEchVAfN" }, "source": [ "Note that if we only supply one array as an input to `plt.plot()`, it uses the values of the array as `y` values and uses the indices of the array as `x` values." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BzJAQSbOVAfN" }, "outputs": [], "source": [ "plt.plot([2, 3, 6, 4, 8, 9, 5, 7, 1])\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "6kdD5BEcVAfN" }, "source": [ "If we want to create a figure with several subplots, we can use `plt.subplots()` to create a grid of subplots. It takes the dimensions of the subplot grid as input *`plt.subplots(rows, columns)`* and returns tow objects. The first is a figure object and the second is a NumPy array containing the subplots. In Matplotlib, subplots are often called *axes*." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FGGdlXIHVAfN" }, "outputs": [], "source": [ "# create a more fine-grained array to work with\n", "a = np.linspace(0, 2*np.pi, num=100)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yXEDAU0YVAfN" }, "outputs": [], "source": [ "# create a two-by-two grid for our subplots\n", "fig, ax = plt.subplots(2, 2)\n", "\n", "# create subplots\n", "ax[0, 0].plot(a, np.sin(a)) # upper-left\n", "ax[0, 1].plot(a, np.cos(a)) # upper-right\n", "ax[1, 0].plot(a, np.tan(a)) # bottom-left\n", "ax[1, 1].plot(a, -a) # bottom-right\n", "\n", "# show figure\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "-Jy4umzyVAfN" }, "source": [ "A more MATLAB-esque way of creating subplots would be to use the alternative `plt.subplot()` method. Using this method, you can define subplot using a three-number combination `plt.subplot(rows, columns, index)`. The indexes of the subplots defined using this method increase in ***row-major*** order and, in true MATLAB fashion, begin with one." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7Aq35w-kVAfN" }, "outputs": [], "source": [ "plt.subplot(2, 2, 1) # upper-left\n", "plt.plot(a, np.sin(a))\n", "plt.subplot(2, 2, 2) # upper-right\n", "plt.plot(a, np.cos(a))\n", "plt.subplot(2, 2, 3) # bottom-left\n", "plt.plot(a, np.tan(a))\n", "plt.subplot(2, 2, 4) # bottom-right\n", "plt.plot(a, -a)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "id": "I-_E8Hh4VAfN" }, "source": [ "---\n", "\n", "## Additional Resources\n", "\n", "This notebook only introduced the core components of NumPy and Matplotlib and did not include any hands-on exercises. If you would also like to learn about some slightly more advanced aspects of NumPy **and** try your hand at some **exercises** involving NumPy and Matplotlib, check out the [University of Helsinki Data Analysis with Python MOOC](https://csmastersuh.github.io/data_analysis_with_python_2020/index.html). Feel free to go through all of the content to get acquainted with all things Python and data analysis, but if you want to focus solely on NumPy and Matplotlib, check out these sections:\n", "\n", "- Basic NumPy: https://csmastersuh.github.io/data_analysis_with_python_2020/numpy.html\n", "- Advanced NumPy: https://csmastersuh.github.io/data_analysis_with_python_2020/numpy2.html\n", "- Matplotlib: https://csmastersuh.github.io/data_analysis_with_python_2020/matplotlib.html\n", "\n", "Furthermore, the official NumPy documentation contains numerous tutorials and quickstart guides designed for users of different backgrounds:\n", "\n", "- Quickstart Tutorial: https://numpy.org/doc/stable/user/quickstart.html\n", "- NumPy Basics for Absolute Beginners: https://numpy.org/doc/stable/user/absolute_beginners.html\n", "- NumPy for MATLAB Users: https://numpy.org/doc/stable/user/numpy-for-matlab-users.html\n", "\n", "The Matplotlib official documentation also contains multiple useful tutorials: https://matplotlib.org/tutorials/index.html" ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" }, "toc-autonumbering": false }, "nbformat": 4, "nbformat_minor": 0 }