Python for Data Analysis and Visualization#

A Tufts University Data Lab Workshop
Written by Uku-Kaspar Uustalu

Python resources: go.tufts.edu/python
Questions: datalab-support@elist.tufts.edu
Feedback: uku-kaspar.uustalu@tufts.edu

Importing Libraries#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import numpy as np
      2 import pandas as pd
----> 3 import matplotlib.pyplot as plt
      4 import seaborn as sns

ModuleNotFoundError: No module named 'matplotlib'

Quick Overview of Matplotlib#

Matplotlib works in a layered fashion. First you define your plot using plt.plot(x, y, ...), then you can use additional plt methods to add more layers to your plot or modify its appearance. Finally, you use plt.show() to show the plot or plt.savefig() to save it to an external file. Let’s see how Matplotlib works in practice by creating some trigonometric plots.

x = np.linspace(0, 2 * np.pi, num = 20)
y = np.sin(x)

plt.plot(x, y)
plt.show()

plt.plot() takes additional arguments that modify the appearance of the plot. See the documentation for details: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html

# we can specify the style of the plot using named arguments
plt.plot(x, y, color = 'red', linestyle = '--', marker = 'o')
plt.show()

# or we could use a shorthand string
plt.plot(x, y, 'r--o')
plt.show()

We can easily add additional layers and stylistic elements to the plot.

plt.plot(x, y, 'r--o')
plt.plot(x, np.cos(x), 'b-*')
plt.title('Sin and Cos')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['sin', 'cos'])
plt.show()

Note that if we only supply one array as an input to plt.plot(), it uses the values of the array as y values and uses the indices of the array as x values.

plt.plot([2, 3, 6, 4, 8, 9, 5, 7, 1])
plt.show()

If we want to create a figure with several subplots, we can use plt.subplots() to create a grid of subplots. It takes the dimensions of the subplot grid as input plt.subplots(rows, columns) and returns tow objects. The first is a figure object and the second is a NumPy array containing the subplots. In Matplotlib, subplots are often called axes.

# create a more fine-grained array to work with
a = np.linspace(0, 2 * np.pi, num = 100)

# create a two-by-two grid for our subplots
fig, ax = plt.subplots(2, 2)

# create subplots
ax[0, 0].plot(a, np.sin(a))     # upper-left
ax[0, 1].plot(a, np.cos(a))     # upper-right
ax[1, 0].plot(a, np.tan(a))     # bottom-left
ax[1, 1].plot(a, -a)            # bottom-right

# show figure
plt.show()

A more MATLAB-esque way of creating subplots would be to use the alternative plt.subplot() method. Using this method, you can define subplot using a three-number combination plt.subplot(rows, columns, index). The indexes of the subplots defined using this method increase in row-major order and, in true MATLAB fashion, begin with one.

plt.subplot(2, 2, 1)    # upper-left
plt.plot(a, np.sin(a))
plt.subplot(2, 2, 2)    # upper-right
plt.plot(a, np.cos(a))
plt.subplot(2, 2, 3)    # bottom-left
plt.plot(a, np.tan(a))
plt.subplot(2, 2, 4)    # bottom-right
plt.plot(a, -a)
plt.show()

Working with Messy Data#

grades = pd.read_csv('data/grades.csv')

grades

	Name	Exam 1	exam2	Exam_3	exam4
0	John Smith	98	94	95	86
1	Mary Johnson	89	92	96	82
2	Robert Williams	88	72	absent	91
3	Jennifer Jones	92	excused	94	99
4	Linda Wilson	84	92	89	94

print(grades)

              Name  Exam 1    exam2  Exam_3  exam4
     John Smith      98       94      95     86
   Mary Johnson      89       92      96     82
Robert Williams      88       72  absent     91
 Jennifer Jones      92  excused      94     99
   Linda Wilson      84       92      89     94

Cleaning Column Names#

grades.rename(str.lower, axis = 'columns')

	name	exam 1	exam2	exam_3	exam4
0	John Smith	98	94	95	86
1	Mary Johnson	89	92	96	82
2	Robert Williams	88	72	absent	91
3	Jennifer Jones	92	excused	94	99
4	Linda Wilson	84	92	89	94

grades

	Name	Exam 1	exam2	Exam_3	exam4
0	John Smith	98	94	95	86
1	Mary Johnson	89	92	96	82
2	Robert Williams	88	72	absent	91
3	Jennifer Jones	92	excused	94	99
4	Linda Wilson	84	92	89	94

grades = grades.rename(str.lower, axis = 'columns')

grades

	name	exam 1	exam2	exam_3	exam4
0	John Smith	98	94	95	86
1	Mary Johnson	89	92	96	82
2	Robert Williams	88	72	absent	91
3	Jennifer Jones	92	excused	94	99
4	Linda Wilson	84	92	89	94

grades.rename(columns = {'exam 1': 'exam1', 'exam_3': 'exam3'}, inplace = True)

grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94	95	86
1	Mary Johnson	89	92	96	82
2	Robert Williams	88	72	absent	91
3	Jennifer Jones	92	excused	94	99
4	Linda Wilson	84	92	89	94

Indexing and Datatypes#

grades.dtypes

name     object
exam1     int64
exam2    object
exam3    object
exam4     int64
dtype: object

grades['name']

       John Smith
     Mary Johnson
  Robert Williams
   Jennifer Jones
     Linda Wilson
Name: name, dtype: object

grades.name

       John Smith
     Mary Johnson
  Robert Williams
   Jennifer Jones
     Linda Wilson
Name: name, dtype: object

grades[['name']]

	name
0	John Smith
1	Mary Johnson
2	Robert Williams
3	Jennifer Jones
4	Linda Wilson

grades['name'][0]

'John Smith'

grades['name'][1]

'Mary Johnson'

grades.name[1]

'Mary Johnson'

grades['exam1'][1]

grades.exam2[1]

'92'

grades['exam3'][1]

'96'

grades.exam4[1]

grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94	95	86
1	Mary Johnson	89	92	96	82
2	Robert Williams	88	72	absent	91
3	Jennifer Jones	92	excused	94	99
4	Linda Wilson	84	92	89	94

print(type(grades['exam3'][0]))
print(type(grades['exam3'][1]))
print(type(grades['exam3'][2]))

<class 'str'>
<class 'str'>
<class 'str'>

Assigning Values and Working with Missing Data#

grades['exam3'][2] = 0

C:\Users\bzhou01\AppData\Local\Temp\ipykernel_22684\2777045509.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  grades['exam3'][2] = 0

Oh no, a really scary warning! What is happening?

Because Python uses something called pass-by-object-reference and does a lot of optimization in the background, the end user (that is you) has little to no control over whether thay are referencing the original object or a copy. This warning is just Pandas letting us know that when using chained indexing to write a value, the behaviour is undefined, meaning that pandas cannot be sure wheter you are are writing to the original data frame or a temporary copy.

To learn more: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94	95	86
1	Mary Johnson	89	92	96	82
2	Robert Williams	88	72	0	91
3	Jennifer Jones	92	excused	94	99
4	Linda Wilson	84	92	89	94

Phew, this time we got lucky. However, with a differet data frame the same approach might actually write the changes to a temporary copy and leave the original data frame unchanged. Chained indexing is dangerous and you should avoid using it to write values. What should we use instead?

There are a lot of options: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

To write a singe value, use .at[row, column]
To write a range of values (or a single value), use .loc[row(s), column(s)]

Note that .at and .loc use row and column labels. The numbers 0, 1, 2, 3, and 4 that we see in front of the rows are actually row labels. By default, row labels match row indexes in pandas. However, quite often you will work with rows that have actual labels. Sometimes those labels might be numeric and resemble indexes, which leads to confusion and error. Hence, if you want to make sure you are using indexes, not labels, use .iat and .iloc instead.

grades.at[2, 'exam3']

grades.loc[3, 'exam2'] = np.NaN

grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94	95	86
1	Mary Johnson	89	92	96	82
2	Robert Williams	88	72	0	91
3	Jennifer Jones	92	NaN	94	99
4	Linda Wilson	84	92	89	94

grades.dtypes

name     object
exam1     int64
exam2    object
exam3    object
exam4     int64
dtype: object

grades['exam2'][0]

'94'

grades.exam3[0]

'95'

grades['exam2'] = pd.to_numeric(grades['exam2'])
grades['exam3'] = pd.to_numeric(grades.exam3)

grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94.0	95	86
1	Mary Johnson	89	92.0	96	82
2	Robert Williams	88	72.0	0	91
3	Jennifer Jones	92	NaN	94	99
4	Linda Wilson	84	92.0	89	94

grades.dtypes

name      object
exam1      int64
exam2    float64
exam3      int64
exam4      int64
dtype: object

Aggregating Data#

grades['sum'] = grades['exam1'] + grades['exam2'] + grades['exam3'] + grades['exam4']

grades

	name	exam1	exam2	exam3	exam4	sum
0	John Smith	98	94.0	95	86	373.0
1	Mary Johnson	89	92.0	96	82	359.0
2	Robert Williams	88	72.0	0	91	251.0
3	Jennifer Jones	92	NaN	94	99	NaN
4	Linda Wilson	84	92.0	89	94	359.0

grades.drop('sum', axis = 'columns', inplace = True)

grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94.0	95	86
1	Mary Johnson	89	92.0	96	82
2	Robert Williams	88	72.0	0	91
3	Jennifer Jones	92	NaN	94	99
4	Linda Wilson	84	92.0	89	94

grades['sum'] = grades.sum(axis = 'columns')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[77], line 1
----> 1 grades['sum'] = grades.sum(axis = 'columns')

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11312, in DataFrame.sum(self, axis, skipna, numeric_only, min_count, **kwargs)
  11303 @doc(make_doc("sum", ndim=2))
  11304 def sum(
  11305     self,
   (...)
  11310     **kwargs,
  11311 ):
> 11312     result = super().sum(axis, skipna, numeric_only, min_count, **kwargs)
  11313     return result.__finalize__(self, method="sum")

File C:\JumboConda\lib\site-packages\pandas\core\generic.py:12078, in NDFrame.sum(self, axis, skipna, numeric_only, min_count, **kwargs)
  12070 def sum(
  12071     self,
  12072     axis: Axis | None = 0,
   (...)
  12076     **kwargs,
  12077 ):
> 12078     return self._min_count_stat_function(
  12079         "sum", nanops.nansum, axis, skipna, numeric_only, min_count, **kwargs
  12080     )

File C:\JumboConda\lib\site-packages\pandas\core\generic.py:12061, in NDFrame._min_count_stat_function(self, name, func, axis, skipna, numeric_only, min_count, **kwargs)
  12058 elif axis is lib.no_default:
  12059     axis = 0
> 12061 return self._reduce(
  12062     func,
  12063     name=name,
  12064     axis=axis,
  12065     skipna=skipna,
  12066     numeric_only=numeric_only,
  12067     min_count=min_count,
  12068 )

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11204, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  11200     df = df.T
  11202 # After possibly _get_data and transposing, we are now in the
  11203 #  simple case where we can use BlockManager.reduce
> 11204 res = df._mgr.reduce(blk_func)
  11205 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
  11206 if out_dtype is not None and out.dtype != "boolean":

File C:\JumboConda\lib\site-packages\pandas\core\internals\managers.py:1459, in BlockManager.reduce(self, func)
   1457 res_blocks: list[Block] = []
   1458 for blk in self.blocks:
-> 1459     nbs = blk.reduce(func)
   1460     res_blocks.extend(nbs)
   1462 index = Index([None])  # placeholder

File C:\JumboConda\lib\site-packages\pandas\core\internals\blocks.py:377, in Block.reduce(self, func)
    371 @final
    372 def reduce(self, func) -> list[Block]:
    373     # We will apply the function and reshape the result into a single-row
    374     #  Block with the same mgr_locs; squeezing will be done at a higher level
    375     assert self.ndim == 2
--> 377     result = func(self.values)
    379     if self.values.ndim == 1:
    380         res_values = result

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11136, in DataFrame._reduce.<locals>.blk_func(values, axis)
  11134         return np.array([result])
  11135 else:
> 11136     return op(values, axis=axis, skipna=skipna, **kwds)

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:85, in disallow.__call__.<locals>._f(*args, **kwargs)
     81     raise TypeError(
     82         f"reduction operation '{f_name}' not allowed for this dtype"
     83     )
     84 try:
---> 85     return f(*args, **kwargs)
     86 except ValueError as e:
     87     # we want to transform an object array
     88     # ValueError message to the more typical TypeError
     89     # e.g. this is normally a disallowed function on
     90     # object arrays that contain strings
     91     if is_object_dtype(args[0]):

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:477, in maybe_operate_rowwise.<locals>.newfunc(values, axis, **kwargs)
    474         results = [func(x, **kwargs) for x in arrs]
    475     return np.array(results)
--> 477 return func(values, axis=axis, **kwargs)

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:646, in nansum(values, axis, skipna, min_count, mask)
    643 elif dtype.kind == "m":
    644     dtype_sum = np.dtype(np.float64)
--> 646 the_sum = values.sum(axis, dtype=dtype_sum)
    647 the_sum = _maybe_null_out(the_sum, axis, mask, values.shape, min_count=min_count)
    649 return the_sum

File C:\JumboConda\lib\site-packages\numpy\core\_methods.py:49, in _sum(a, axis, dtype, out, keepdims, initial, where)
     47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     48          initial=_NoValue, where=True):
---> 49     return umr_sum(a, axis, dtype, out, keepdims, initial, where)

TypeError: can only concatenate str (not "int") to str

grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94.0	95	86
1	Mary Johnson	89	92.0	96	82
2	Robert Williams	88	72.0	0	91
3	Jennifer Jones	92	NaN	94	99
4	Linda Wilson	84	92.0	89	94

A Better Way of Working with Messy Data#

del grades

grades = pd.read_csv('data/grades.csv', na_values = 'excused')
grades

	Name	Exam 1	exam2	Exam_3	exam4
0	John Smith	98	94.0	95	86
1	Mary Johnson	89	92.0	96	82
2	Robert Williams	88	72.0	absent	91
3	Jennifer Jones	92	NaN	94	99
4	Linda Wilson	84	92.0	89	94

grades.rename(str.lower, axis = 'columns', inplace = True)
grades.rename(columns = {'exam 1': 'exam1', 'exam_3': 'exam3'}, inplace = True)
grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94.0	95	86
1	Mary Johnson	89	92.0	96	82
2	Robert Williams	88	72.0	absent	91
3	Jennifer Jones	92	NaN	94	99
4	Linda Wilson	84	92.0	89	94

grades.dtypes

name      object
exam1      int64
exam2    float64
exam3     object
exam4      int64
dtype: object

grades['exam3'] = pd.to_numeric(grades['exam3'], errors = 'coerce')
grades

	name	exam1	exam2	exam3	exam4
0	John Smith	98	94.0	95.0	86
1	Mary Johnson	89	92.0	96.0	82
2	Robert Williams	88	72.0	NaN	91
3	Jennifer Jones	92	NaN	94.0	99
4	Linda Wilson	84	92.0	89.0	94

grades.dtypes

name      object
exam1      int64
exam2    float64
exam3    float64
exam4      int64
dtype: object

grades['exam3'] = grades['exam3'].fillna(0)
grades['name'][2]

'Robert Williams'

grades['mean'] = grades.mean(axis = 'columns')
grades

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[130], line 1
----> 1 grades['mean'] = grades.mean(axis = 'columns')
      2 grades
      3 grades["name"]

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11335, in DataFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11327 @doc(make_doc("mean", ndim=2))
  11328 def mean(
  11329     self,
   (...)
  11333     **kwargs,
  11334 ):
> 11335     result = super().mean(axis, skipna, numeric_only, **kwargs)
  11336     if isinstance(result, Series):
  11337         result = result.__finalize__(self, method="mean")

File C:\JumboConda\lib\site-packages\pandas\core\generic.py:11992, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
  11985 def mean(
  11986     self,
  11987     axis: Axis | None = 0,
   (...)
  11990     **kwargs,
  11991 ) -> Series | float:
> 11992     return self._stat_function(
  11993         "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
  11994     )

File C:\JumboConda\lib\site-packages\pandas\core\generic.py:11949, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11945 nv.validate_func(name, (), kwargs)
  11947 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11949 return self._reduce(
  11950     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11951 )

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11204, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  11200     df = df.T
  11202 # After possibly _get_data and transposing, we are now in the
  11203 #  simple case where we can use BlockManager.reduce
> 11204 res = df._mgr.reduce(blk_func)
  11205 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
  11206 if out_dtype is not None and out.dtype != "boolean":

File C:\JumboConda\lib\site-packages\pandas\core\internals\managers.py:1459, in BlockManager.reduce(self, func)
   1457 res_blocks: list[Block] = []
   1458 for blk in self.blocks:
-> 1459     nbs = blk.reduce(func)
   1460     res_blocks.extend(nbs)
   1462 index = Index([None])  # placeholder

File C:\JumboConda\lib\site-packages\pandas\core\internals\blocks.py:377, in Block.reduce(self, func)
    371 @final
    372 def reduce(self, func) -> list[Block]:
    373     # We will apply the function and reshape the result into a single-row
    374     #  Block with the same mgr_locs; squeezing will be done at a higher level
    375     assert self.ndim == 2
--> 377     result = func(self.values)
    379     if self.values.ndim == 1:
    380         res_values = result

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11136, in DataFrame._reduce.<locals>.blk_func(values, axis)
  11134         return np.array([result])
  11135 else:
> 11136     return op(values, axis=axis, skipna=skipna, **kwds)

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    145         result = alt(values, axis=axis, skipna=skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:719, in nanmean(values, axis, skipna, mask)
    716     dtype_count = dtype
    718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 719 the_sum = values.sum(axis, dtype=dtype_sum)
    720 the_sum = _ensure_numeric(the_sum)
    722 if axis is not None and getattr(the_sum, "ndim", False):

File C:\JumboConda\lib\site-packages\numpy\core\_methods.py:49, in _sum(a, axis, dtype, out, keepdims, initial, where)
     47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     48          initial=_NoValue, where=True):
---> 49     return umr_sum(a, axis, dtype, out, keepdims, initial, where)

TypeError: can only concatenate str (not "int") to str

grades.loc[:, 'max'] = grades.max(axis = 'columns')
grades.loc['mean'] = grades.mean(axis = 'rows')
grades.loc['max', :] = grades.max(axis = 'rows')
grades

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[87], line 1
----> 1 grades.loc[:, 'max'] = grades.max(axis = 'columns')
      2 grades.loc['mean'] = grades.mean(axis = 'rows')
      3 grades.loc['max', :] = grades.max(axis = 'rows')

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11298, in DataFrame.max(self, axis, skipna, numeric_only, **kwargs)
  11290 @doc(make_doc("max", ndim=2))
  11291 def max(
  11292     self,
   (...)
  11296     **kwargs,
  11297 ):
> 11298     result = super().max(axis, skipna, numeric_only, **kwargs)
  11299     if isinstance(result, Series):
  11300         result = result.__finalize__(self, method="max")

File C:\JumboConda\lib\site-packages\pandas\core\generic.py:11976, in NDFrame.max(self, axis, skipna, numeric_only, **kwargs)
  11969 def max(
  11970     self,
  11971     axis: Axis | None = 0,
   (...)
  11974     **kwargs,
  11975 ):
> 11976     return self._stat_function(
  11977         "max",
  11978         nanops.nanmax,
  11979         axis,
  11980         skipna,
  11981         numeric_only,
  11982         **kwargs,
  11983     )

File C:\JumboConda\lib\site-packages\pandas\core\generic.py:11949, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
  11945 nv.validate_func(name, (), kwargs)
  11947 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11949 return self._reduce(
  11950     func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11951 )

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11204, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  11200     df = df.T
  11202 # After possibly _get_data and transposing, we are now in the
  11203 #  simple case where we can use BlockManager.reduce
> 11204 res = df._mgr.reduce(blk_func)
  11205 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
  11206 if out_dtype is not None and out.dtype != "boolean":

File C:\JumboConda\lib\site-packages\pandas\core\internals\managers.py:1459, in BlockManager.reduce(self, func)
   1457 res_blocks: list[Block] = []
   1458 for blk in self.blocks:
-> 1459     nbs = blk.reduce(func)
   1460     res_blocks.extend(nbs)
   1462 index = Index([None])  # placeholder

File C:\JumboConda\lib\site-packages\pandas\core\internals\blocks.py:377, in Block.reduce(self, func)
    371 @final
    372 def reduce(self, func) -> list[Block]:
    373     # We will apply the function and reshape the result into a single-row
    374     #  Block with the same mgr_locs; squeezing will be done at a higher level
    375     assert self.ndim == 2
--> 377     result = func(self.values)
    379     if self.values.ndim == 1:
    380         res_values = result

File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11136, in DataFrame._reduce.<locals>.blk_func(values, axis)
  11134         return np.array([result])
  11135 else:
> 11136     return op(values, axis=axis, skipna=skipna, **kwds)

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
    145         result = alt(values, axis=axis, skipna=skipna, **kwds)
    146 else:
--> 147     result = alt(values, axis=axis, skipna=skipna, **kwds)
    149 return result

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
    401 if datetimelike and mask is None:
    402     mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
    406 if datetimelike:
    407     result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)

File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:1092, in _nanminmax.<locals>.reduction(values, axis, skipna, mask)
   1087     return _na_for_min_count(values, axis)
   1089 values, mask = _get_values(
   1090     values, skipna, fill_value_typ=fill_value_typ, mask=mask
   1091 )
-> 1092 result = getattr(values, meth)(axis)
   1093 result = _maybe_null_out(result, axis, mask, values.shape)
   1094 return result

File C:\JumboConda\lib\site-packages\numpy\core\_methods.py:41, in _amax(a, axis, out, keepdims, initial, where)
     39 def _amax(a, axis=None, out=None, keepdims=False,
     40           initial=_NoValue, where=True):
---> 41     return umr_maximum(a, axis, None, out, keepdims, initial, where)

TypeError: '>=' not supported between instances of 'str' and 'int'

Working with Real Data#

avocados = pd.read_csv('data/avocado.csv')

avocados

	date	average_price	total_volume	4046	4225	4770	total_bags	small_bags	large_bags	xlarge_bags	type	year	geography
0	2015-01-04	1.22	40873.28	2819.50	28287.42	49.90	9716.46	9186.93	529.53	0.00	conventional	2015	Albany
1	2015-01-04	1.79	1373.95	57.42	153.88	0.00	1162.65	1162.65	0.00	0.00	organic	2015	Albany
2	2015-01-04	1.00	435021.49	364302.39	23821.16	82.15	46815.79	16707.15	30108.64	0.00	conventional	2015	Atlanta
3	2015-01-04	1.76	3846.69	1500.15	938.35	0.00	1408.19	1071.35	336.84	0.00	organic	2015	Atlanta
4	2015-01-04	1.08	788025.06	53987.31	552906.04	39995.03	141136.68	137146.07	3990.61	0.00	conventional	2015	Baltimore/Washington
...	...	...	...	...	...	...	...	...	...	...	...	...	...
30016	2020-05-17	1.58	2271254.00	150100.00	198457.00	5429.00	1917250.00	1121691.00	795559.00	0.00	organic	2020	Total U.S.
30017	2020-05-17	1.09	8667913.24	2081824.04	1020965.12	33410.85	5531562.87	2580802.48	2817078.77	133681.62	conventional	2020	West
30018	2020-05-17	1.71	384158.00	23455.00	39738.00	1034.00	319932.00	130051.00	189881.00	0.00	organic	2020	West
30019	2020-05-17	0.89	1240709.05	430203.10	126497.28	21104.42	662904.25	395909.35	265177.09	1817.81	conventional	2020	West Tex/New Mexico
30020	2020-05-17	1.58	36881.00	1147.00	1243.00	2645.00	31846.00	25621.00	6225.00	0.00	organic	2020	West Tex/New Mexico

30021 rows × 13 columns

avocados.head()

	date	average_price	total_volume	4046	4225	4770	total_bags	small_bags	large_bags	type	year	geography
0	2015-01-04	1.22	40873.28	2819.50	28287.42	49.90	9716.46	9186.93	529.53	conventional	2015	Albany
1	2015-01-04	1.79	1373.95	57.42	153.88	0.00	1162.65	1162.65	0.00	organic	2015	Albany
2	2015-01-04	1.00	435021.49	364302.39	23821.16	82.15	46815.79	16707.15	30108.64	conventional	2015	Atlanta
3	2015-01-04	1.76	3846.69	1500.15	938.35	0.00	1408.19	1071.35	336.84	organic	2015	Atlanta
4	2015-01-04	1.08	788025.06	53987.31	552906.04	39995.03	141136.68	137146.07	3990.61	conventional	2015	Baltimore/Washington

avocados.shape

(30021, 13)

avocados.dtypes

date              object
average_price    float64
total_volume     float64
4046             float64
4225             float64
4770             float64
total_bags       float64
small_bags       float64
large_bags       float64
xlarge_bags      float64
type              object
year               int64
geography         object
dtype: object

Subsetting Data using Boolean Indexing#

avocados.geography

                    Albany
                    Albany
                   Atlanta
                   Atlanta
      Baltimore/Washington
                 ...         
            Total U.S.
                  West
                  West
   West Tex/New Mexico
   West Tex/New Mexico
Name: geography, Length: 30021, dtype: object

avocados.geography == 'Boston'

      False
      False
      False
      False
      False
         ...  
  False
  False
  False
  False
  False
Name: geography, Length: 30021, dtype: bool

avocados[avocados.geography == 'Boston']

	date	average_price	total_volume	4046	4225	4770	total_bags	small_bags	large_bags	xlarge_bags	type	year	geography
8	2015-01-04	1.02	491738.00	7193.87	396752.18	128.82	87663.13	87406.84	256.29	0.00	conventional	2015	Boston
9	2015-01-04	1.83	2192.13	8.66	939.43	0.00	1244.04	1244.04	0.00	0.00	organic	2015	Boston
116	2015-01-11	1.10	437771.89	5548.11	320577.36	121.81	111524.61	111192.88	331.73	0.00	conventional	2015	Boston
117	2015-01-11	1.94	2217.82	12.82	956.07	0.00	1248.93	1248.93	0.00	0.00	organic	2015	Boston
224	2015-01-18	1.23	401331.33	4383.76	287778.52	132.53	109036.52	108668.74	367.78	0.00	conventional	2015	Boston
...	...	...	...	...	...	...	...	...	...	...	...	...	...
29706	2020-05-03	2.10	37432.00	109.00	2350.00	0.00	34972.00	32895.00	2077.00	0.00	organic	2020	Boston
29813	2020-05-10	1.43	1124288.41	27003.57	716906.87	597.73	379780.24	281283.46	94554.55	3942.23	conventional	2020	Boston
29814	2020-05-10	2.11	42554.00	98.00	2670.00	0.00	39786.00	38008.00	1778.00	0.00	organic	2020	Boston
29921	2020-05-17	1.50	941667.98	22010.56	574986.18	533.76	344137.48	237353.97	102717.96	4065.55	conventional	2020	Boston
29922	2020-05-17	2.04	44906.00	74.00	2495.00	0.00	42337.00	40460.00	1877.00	0.00	organic	2020	Boston

556 rows × 13 columns

avocados_boston = avocados[avocados.geography == 'Boston']

avocados_boston.head(10)

	date	average_price	total_volume	4046	4225	4770	total_bags	small_bags	large_bags	type	year	geography
8	2015-01-04	1.02	491738.00	7193.87	396752.18	128.82	87663.13	87406.84	256.29	conventional	2015	Boston
9	2015-01-04	1.83	2192.13	8.66	939.43	0.00	1244.04	1244.04	0.00	organic	2015	Boston
116	2015-01-11	1.10	437771.89	5548.11	320577.36	121.81	111524.61	111192.88	331.73	conventional	2015	Boston
117	2015-01-11	1.94	2217.82	12.82	956.07	0.00	1248.93	1248.93	0.00	organic	2015	Boston
224	2015-01-18	1.23	401331.33	4383.76	287778.52	132.53	109036.52	108668.74	367.78	conventional	2015	Boston
225	2015-01-18	2.00	2209.34	4.22	834.34	0.00	1370.78	1370.78	0.00	organic	2015	Boston
332	2015-01-25	1.17	409343.56	4730.77	288094.65	45.98	116472.16	115538.05	934.11	conventional	2015	Boston
333	2015-01-25	2.01	1948.28	3.14	760.39	0.00	1184.75	1184.75	0.00	organic	2015	Boston
440	2015-02-01	1.22	490022.14	6082.88	373452.06	272.81	110214.39	109900.93	313.46	conventional	2015	Boston
441	2015-02-01	1.78	2943.85	12.18	1308.94	0.00	1622.73	1622.73	0.00	organic	2015	Boston

avocados_boston_copy = avocados[avocados.geography == 'Boston'].copy()

avocados_boston_copy.head(10)

	date	average_price	total_volume	4046	4225	4770	total_bags	small_bags	large_bags	type	year	geography
8	2015-01-04	1.02	491738.00	7193.87	396752.18	128.82	87663.13	87406.84	256.29	conventional	2015	Boston
9	2015-01-04	1.83	2192.13	8.66	939.43	0.00	1244.04	1244.04	0.00	organic	2015	Boston
116	2015-01-11	1.10	437771.89	5548.11	320577.36	121.81	111524.61	111192.88	331.73	conventional	2015	Boston
117	2015-01-11	1.94	2217.82	12.82	956.07	0.00	1248.93	1248.93	0.00	organic	2015	Boston
224	2015-01-18	1.23	401331.33	4383.76	287778.52	132.53	109036.52	108668.74	367.78	conventional	2015	Boston
225	2015-01-18	2.00	2209.34	4.22	834.34	0.00	1370.78	1370.78	0.00	organic	2015	Boston
332	2015-01-25	1.17	409343.56	4730.77	288094.65	45.98	116472.16	115538.05	934.11	conventional	2015	Boston
333	2015-01-25	2.01	1948.28	3.14	760.39	0.00	1184.75	1184.75	0.00	organic	2015	Boston
440	2015-02-01	1.22	490022.14	6082.88	373452.06	272.81	110214.39	109900.93	313.46	conventional	2015	Boston
441	2015-02-01	1.78	2943.85	12.18	1308.94	0.00	1622.73	1622.73	0.00	organic	2015	Boston

np.mean(avocados_boston.average_price[avocados_boston.year == 2019])

1.5385576923076922

mean_2019 = np.mean(avocados_boston.average_price[avocados_boston.year == 2019])

print("The avereage price for avocados in the Boston area in the year 2019 was: $", round(mean_2019, 2))

The avereage price for avocados in the Boston area in the year 2019 was: $ 1.54

Creating Plots#

plt.plot(avocados_boston.date, avocados_boston.average_price)
plt.show()

../../_images/d7cb3ef1ed3cdfdd80bd3a6beb842f9fc5c4f4e96e0f2c83c690d7d535310264.png

plt.figure(figsize = (20, 8))
plt.plot(avocados_boston.date, avocados_boston.average_price, color = 'green', linestyle = '--', marker = 'o')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

../../_images/55dbcc89abf727dbd70e46a290234ef631410fe0efa51c7c7e874567af7461c4.png

avocados_boston.plot(x = 'date', y = 'average_price', figsize = (18, 8), kind='line', color = 'green')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

../../_images/8d9503a80c284667c45358f367ae50044a57ac1e063a2463a75870048e30bca1.png

avocados_boston[avocados_boston.year == 2019].plot(x = 'date', y = 'average_price', figsize = (18, 8), kind='line', color = 'green')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()

../../_images/aacbca0917e34588a28e9f5ee0e71870f66608259a2f6570199d60174acde7cf.png

plt.hist(avocados.average_price)
plt.xlabel('Price')
plt.show()

../../_images/b1557ff81c716c1005f1f2dd71d148e2899efb26e5e18586b15cc5de2daaf3bb.png

sns.histplot(avocados.average_price, color = 'r', kde = True)

<Axes: xlabel='average_price', ylabel='Count'>

../../_images/daa24bbe8933da534723593fbabc083193c059d0f4fb9a246409aa6c6908ed04.png

sns.histplot(avocados.average_price[avocados.year == 2019], color = 'r', kde = True)

<Axes: xlabel='average_price', ylabel='Count'>

../../_images/5d759301a5bf7ccdb7b35c5e006d80c328c278eee8630e6fa8c816f0e878c7e6.png

sns.histplot(avocados.average_price[avocados.geography == 'Boston'], color = 'r', kde = True)

<Axes: xlabel='average_price', ylabel='Count'>

../../_images/3e4305a2547675181e2a04b410f2b37319ca70a20b30e5b034832c1ced0871fd.png

Combining Datasets and Long vs Wide Data#

pop = pd.read_csv('data/population.csv')

pop

	Country Name	Country Code	Indicator Name	Indicator Code	1960	1961	1962	1963	1964	1965	...	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
0	Aruba	ABW	Population, total	SP.POP.TOTL	54211.0	55438.0	56225.0	56695.0	57032.0	57360.0	...	101669.0	102046.0	102560.0	103159.0	103774.0	104341.0	104872.0	105366.0	105845.0	106314.0
1	Afghanistan	AFG	Population, total	SP.POP.TOTL	8996973.0	9169410.0	9351441.0	9543205.0	9744781.0	9956320.0	...	29185507.0	30117413.0	31161376.0	32269589.0	33370794.0	34413603.0	35383128.0	36296400.0	37172386.0	38041754.0
2	Angola	AGO	Population, total	SP.POP.TOTL	5454933.0	5531472.0	5608539.0	5679458.0	5735044.0	5770570.0	...	23356246.0	24220661.0	25107931.0	26015780.0	26941779.0	27884381.0	28842484.0	29816748.0	30809762.0	31825295.0
3	Albania	ALB	Population, total	SP.POP.TOTL	1608800.0	1659800.0	1711319.0	1762621.0	1814135.0	1864791.0	...	2913021.0	2905195.0	2900401.0	2895092.0	2889104.0	2880703.0	2876101.0	2873457.0	2866376.0	2854191.0
4	Andorra	AND	Population, total	SP.POP.TOTL	13411.0	14375.0	15370.0	16412.0	17469.0	18549.0	...	84449.0	83747.0	82427.0	80774.0	79213.0	78011.0	77297.0	77001.0	77006.0	77142.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
259	Kosovo	XKX	Population, total	SP.POP.TOTL	947000.0	966000.0	994000.0	1022000.0	1050000.0	1078000.0	...	1775680.0	1791000.0	1807106.0	1818117.0	1812771.0	1788196.0	1777557.0	1791003.0	1797085.0	1794248.0
260	Yemen, Rep.	YEM	Population, total	SP.POP.TOTL	5315355.0	5393036.0	5473671.0	5556766.0	5641597.0	5727751.0	...	23154855.0	23807588.0	24473178.0	25147109.0	25823485.0	26497889.0	27168210.0	27834821.0	28498687.0	29161922.0
261	South Africa	ZAF	Population, total	SP.POP.TOTL	17099840.0	17524533.0	17965725.0	18423161.0	18896307.0	19384841.0	...	51216964.0	52004172.0	52834005.0	53689236.0	54545991.0	55386367.0	56203654.0	57000451.0	57779622.0	58558270.0
262	Zambia	ZMB	Population, total	SP.POP.TOTL	3070776.0	3164329.0	3260650.0	3360104.0	3463213.0	3570464.0	...	13605984.0	14023193.0	14465121.0	14926504.0	15399753.0	15879361.0	16363507.0	16853688.0	17351822.0	17861030.0
263	Zimbabwe	ZWE	Population, total	SP.POP.TOTL	3776681.0	3905034.0	4039201.0	4178726.0	4322861.0	4471177.0	...	12697723.0	12894316.0	13115131.0	13350356.0	13586681.0	13814629.0	14030390.0	14236745.0	14439018.0	14645468.0

264 rows × 64 columns

pop.drop(labels = ['Indicator Name', 'Indicator Code'], axis = 1, inplace = True)

pop

	Country Name	Country Code	1960	1961	1962	1963	1964	1965	1966	1967	...	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
0	Aruba	ABW	54211.0	55438.0	56225.0	56695.0	57032.0	57360.0	57715.0	58055.0	...	101669.0	102046.0	102560.0	103159.0	103774.0	104341.0	104872.0	105366.0	105845.0	106314.0
1	Afghanistan	AFG	8996973.0	9169410.0	9351441.0	9543205.0	9744781.0	9956320.0	10174836.0	10399926.0	...	29185507.0	30117413.0	31161376.0	32269589.0	33370794.0	34413603.0	35383128.0	36296400.0	37172386.0	38041754.0
2	Angola	AGO	5454933.0	5531472.0	5608539.0	5679458.0	5735044.0	5770570.0	5781214.0	5774243.0	...	23356246.0	24220661.0	25107931.0	26015780.0	26941779.0	27884381.0	28842484.0	29816748.0	30809762.0	31825295.0
3	Albania	ALB	1608800.0	1659800.0	1711319.0	1762621.0	1814135.0	1864791.0	1914573.0	1965598.0	...	2913021.0	2905195.0	2900401.0	2895092.0	2889104.0	2880703.0	2876101.0	2873457.0	2866376.0	2854191.0
4	Andorra	AND	13411.0	14375.0	15370.0	16412.0	17469.0	18549.0	19647.0	20758.0	...	84449.0	83747.0	82427.0	80774.0	79213.0	78011.0	77297.0	77001.0	77006.0	77142.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
259	Kosovo	XKX	947000.0	966000.0	994000.0	1022000.0	1050000.0	1078000.0	1106000.0	1135000.0	...	1775680.0	1791000.0	1807106.0	1818117.0	1812771.0	1788196.0	1777557.0	1791003.0	1797085.0	1794248.0
260	Yemen, Rep.	YEM	5315355.0	5393036.0	5473671.0	5556766.0	5641597.0	5727751.0	5816247.0	5907874.0	...	23154855.0	23807588.0	24473178.0	25147109.0	25823485.0	26497889.0	27168210.0	27834821.0	28498687.0	29161922.0
261	South Africa	ZAF	17099840.0	17524533.0	17965725.0	18423161.0	18896307.0	19384841.0	19888250.0	20406864.0	...	51216964.0	52004172.0	52834005.0	53689236.0	54545991.0	55386367.0	56203654.0	57000451.0	57779622.0	58558270.0
262	Zambia	ZMB	3070776.0	3164329.0	3260650.0	3360104.0	3463213.0	3570464.0	3681955.0	3797873.0	...	13605984.0	14023193.0	14465121.0	14926504.0	15399753.0	15879361.0	16363507.0	16853688.0	17351822.0	17861030.0
263	Zimbabwe	ZWE	3776681.0	3905034.0	4039201.0	4178726.0	4322861.0	4471177.0	4623351.0	4779827.0	...	12697723.0	12894316.0	13115131.0	13350356.0	13586681.0	13814629.0	14030390.0	14236745.0	14439018.0	14645468.0

264 rows × 62 columns

pop_long = pop.melt(id_vars = ['Country Name', 'Country Code'], var_name = 'Year', value_name = 'Population')

pop_long

	Country Name	Country Code	Year	Population
0	Aruba	ABW	1960	54211.0
1	Afghanistan	AFG	1960	8996973.0
2	Angola	AGO	1960	5454933.0
3	Albania	ALB	1960	1608800.0
4	Andorra	AND	1960	13411.0
...	...	...	...	...
15835	Kosovo	XKX	2019	1794248.0
15836	Yemen, Rep.	YEM	2019	29161922.0
15837	South Africa	ZAF	2019	58558270.0
15838	Zambia	ZMB	2019	17861030.0
15839	Zimbabwe	ZWE	2019	14645468.0

15840 rows × 4 columns

gdp = pd.read_csv('data/gdp.csv').drop(labels = ['Indicator Name', 'Indicator Code'], axis = 1)

gdp

	Country Name	Country Code	1960	1961	1962	1963	1964	1965	1966	1967	...	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
0	Aruba	ABW	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	2.390503e+09	2.549721e+09	2.534637e+09	2.701676e+09	2.765363e+09	2.919553e+09	2.965922e+09	3.056425e+09	NaN	NaN
1	Afghanistan	AFG	5.377778e+08	5.488889e+08	5.466667e+08	7.511112e+08	8.000000e+08	1.006667e+09	1.400000e+09	1.673333e+09	...	1.585657e+10	1.780429e+10	2.000160e+10	2.056107e+10	2.048489e+10	1.990711e+10	1.936264e+10	2.019176e+10	1.948438e+10	1.910135e+10
2	Angola	AGO	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	8.379950e+10	1.117900e+11	1.280530e+11	1.367100e+11	1.457120e+11	1.161940e+11	1.011240e+11	1.221240e+11	1.013530e+11	9.463542e+10
3	Albania	ALB	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1.192693e+10	1.289077e+10	1.231983e+10	1.277622e+10	1.322814e+10	1.138685e+10	1.186120e+10	1.301969e+10	1.514702e+10	1.527808e+10
4	Andorra	AND	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	3.449967e+09	3.629204e+09	3.188809e+09	3.193704e+09	3.271808e+09	2.789870e+09	2.896679e+09	3.000181e+09	3.218316e+09	3.154058e+09
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
259	Kosovo	XKX	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	5.835874e+09	6.701698e+09	6.499807e+09	7.074778e+09	7.396705e+09	6.442916e+09	6.719172e+09	7.245707e+09	7.942962e+09	7.926108e+09
260	Yemen, Rep.	YEM	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	3.090675e+10	3.272642e+10	3.540134e+10	4.041524e+10	4.320647e+10	3.697620e+10	2.808468e+10	2.456133e+10	2.759126e+10	NaN
261	South Africa	ZAF	7.575397e+09	7.972997e+09	8.497997e+09	9.423396e+09	1.037400e+10	1.133440e+10	1.235500e+10	1.377739e+10	...	3.753490e+11	4.164190e+11	3.963330e+11	3.668290e+11	3.509050e+11	3.176210e+11	2.963570e+11	3.495540e+11	3.682890e+11	3.514320e+11
262	Zambia	ZMB	7.130000e+08	6.962857e+08	6.931429e+08	7.187143e+08	8.394286e+08	1.082857e+09	1.264286e+09	1.368000e+09	...	2.026556e+10	2.345952e+10	2.550306e+10	2.804551e+10	2.715065e+10	2.124335e+10	2.095476e+10	2.586814e+10	2.700524e+10	2.306472e+10
263	Zimbabwe	ZWE	1.052990e+09	1.096647e+09	1.117602e+09	1.159512e+09	1.217138e+09	1.311436e+09	1.281750e+09	1.397002e+09	...	1.204166e+10	1.410192e+10	1.711485e+10	1.909102e+10	1.949552e+10	1.996312e+10	2.054868e+10	2.204090e+10	2.431156e+10	2.144076e+10

264 rows × 62 columns

gdp_long = gdp.melt(id_vars = ['Country Name', 'Country Code'], var_name = 'Year', value_name = 'GDP')

gdp_long

	Country Name	Country Code	Year	GDP
0	Aruba	ABW	1960	NaN
1	Afghanistan	AFG	1960	5.377778e+08
2	Angola	AGO	1960	NaN
3	Albania	ALB	1960	NaN
4	Andorra	AND	1960	NaN
...	...	...	...	...
15835	Kosovo	XKX	2019	7.926108e+09
15836	Yemen, Rep.	YEM	2019	NaN
15837	South Africa	ZAF	2019	3.514320e+11
15838	Zambia	ZMB	2019	2.306472e+10
15839	Zimbabwe	ZWE	2019	2.144076e+10

15840 rows × 4 columns

countries = pop_long.merge(gdp_long.drop(labels = ['Country Name'], axis = 1), 
                           on = ['Country Code', 'Year'], how = 'inner')

countries

	Country Name	Country Code	Year	Population	GDP
0	Aruba	ABW	1960	54211.0	NaN
1	Afghanistan	AFG	1960	8996973.0	5.377778e+08
2	Angola	AGO	1960	5454933.0	NaN
3	Albania	ALB	1960	1608800.0	NaN
4	Andorra	AND	1960	13411.0	NaN
...	...	...	...	...	...
15835	Kosovo	XKX	2019	1794248.0	7.926108e+09
15836	Yemen, Rep.	YEM	2019	29161922.0	NaN
15837	South Africa	ZAF	2019	58558270.0	3.514320e+11
15838	Zambia	ZMB	2019	17861030.0	2.306472e+10
15839	Zimbabwe	ZWE	2019	14645468.0	2.144076e+10

15840 rows × 5 columns

countries['GDP per capita'] = countries['GDP'] / countries['Population']

countries

	Country Name	Country Code	Year	Population	GDP	GDP per capita
0	Aruba	ABW	1960	54211.0	NaN	NaN
1	Afghanistan	AFG	1960	8996973.0	5.377778e+08	59.773194
2	Angola	AGO	1960	5454933.0	NaN	NaN
3	Albania	ALB	1960	1608800.0	NaN	NaN
4	Andorra	AND	1960	13411.0	NaN	NaN
...	...	...	...	...	...	...
15835	Kosovo	XKX	2019	1794248.0	7.926108e+09	4417.509940
15836	Yemen, Rep.	YEM	2019	29161922.0	NaN	NaN
15837	South Africa	ZAF	2019	58558270.0	3.514320e+11	6001.406804
15838	Zambia	ZMB	2019	17861030.0	2.306472e+10	1291.343357
15839	Zimbabwe	ZWE	2019	14645468.0	2.144076e+10	1463.985910

15840 rows × 6 columns

random_countries = np.random.choice(countries['Country Code'].unique(), 10)

countries_select = countries[countries['Country Code'].isin(random_countries)]

countries_select

	Country Name	Country Code	Year	Population	GDP	GDP per capita
0	Aruba	ABW	1960	5.421100e+04	NaN	NaN
3	Albania	ALB	1960	1.608800e+06	NaN	NaN
6	United Arab Emirates	ARE	1960	9.241800e+04	NaN	NaN
66	Euro area	EMU	1960	2.652039e+08	2.448960e+11	923.425216
67	Eritrea	ERI	1960	1.007590e+06	NaN	NaN
...	...	...	...	...	...	...
15718	Luxembourg	LUX	2019	6.198960e+05	7.110492e+10	114704.594171
15757	Other small states	OSS	2019	3.136041e+07	4.384680e+11	13981.578747
15775	Romania	ROU	2019	1.935654e+07	2.500770e+11	12919.506705
15778	South Asia	SAS	2019	1.835777e+09	3.597970e+12	1959.916976
15803	Chad	TCD	2019	1.594688e+07	1.131495e+10	709.540310

600 rows × 6 columns

for name, data in countries_select.groupby('Country Name'):
    data.plot(x = 'Year', y = 'GDP per capita', label = name, figsize = (18, 8), ax = plt.gca())
plt.show()

../../_images/1042bab2e75d7e4184cc340b4beeab2e221a31e3ca065b62c269d87d80db01fe.png