Python for Data Analysis and Visualization#
A Tufts University Data Lab Workshop
Written by Uku-Kaspar Uustalu
Python resources: go.tufts.edu/python
Questions: datalab-support@elist.tufts.edu
Feedback: uku-kaspar.uustalu@tufts.edu
Importing Libraries#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 3
1 import numpy as np
2 import pandas as pd
----> 3 import matplotlib.pyplot as plt
4 import seaborn as sns
ModuleNotFoundError: No module named 'matplotlib'
Quick Overview of Matplotlib#
Matplotlib works in a layered fashion. First you define your plot using plt.plot(x, y, ...)
, then you can use additional plt
methods to add more layers to your plot or modify its appearance. Finally, you use plt.show()
to show the plot or plt.savefig()
to save it to an external file. Let’s see how Matplotlib works in practice by creating some trigonometric plots.
x = np.linspace(0, 2 * np.pi, num = 20)
y = np.sin(x)
plt.plot(x, y)
plt.show()
plt.plot()
takes additional arguments that modify the appearance of the plot. See the documentation for details: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html
# we can specify the style of the plot using named arguments
plt.plot(x, y, color = 'red', linestyle = '--', marker = 'o')
plt.show()
# or we could use a shorthand string
plt.plot(x, y, 'r--o')
plt.show()
We can easily add additional layers and stylistic elements to the plot.
plt.plot(x, y, 'r--o')
plt.plot(x, np.cos(x), 'b-*')
plt.title('Sin and Cos')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(['sin', 'cos'])
plt.show()
Note that if we only supply one array as an input to plt.plot()
, it uses the values of the array as y
values and uses the indices of the array as x
values.
plt.plot([2, 3, 6, 4, 8, 9, 5, 7, 1])
plt.show()
If we want to create a figure with several subplots, we can use plt.subplots()
to create a grid of subplots. It takes the dimensions of the subplot grid as input plt.subplots(rows, columns)
and returns tow objects. The first is a figure object and the second is a NumPy array containing the subplots. In Matplotlib, subplots are often called axes.
# create a more fine-grained array to work with
a = np.linspace(0, 2 * np.pi, num = 100)
# create a two-by-two grid for our subplots
fig, ax = plt.subplots(2, 2)
# create subplots
ax[0, 0].plot(a, np.sin(a)) # upper-left
ax[0, 1].plot(a, np.cos(a)) # upper-right
ax[1, 0].plot(a, np.tan(a)) # bottom-left
ax[1, 1].plot(a, -a) # bottom-right
# show figure
plt.show()
A more MATLAB-esque way of creating subplots would be to use the alternative plt.subplot()
method. Using this method, you can define subplot using a three-number combination plt.subplot(rows, columns, index)
. The indexes of the subplots defined using this method increase in row-major order and, in true MATLAB fashion, begin with one.
plt.subplot(2, 2, 1) # upper-left
plt.plot(a, np.sin(a))
plt.subplot(2, 2, 2) # upper-right
plt.plot(a, np.cos(a))
plt.subplot(2, 2, 3) # bottom-left
plt.plot(a, np.tan(a))
plt.subplot(2, 2, 4) # bottom-right
plt.plot(a, -a)
plt.show()
Working with Messy Data#
grades = pd.read_csv('data/grades.csv')
grades
Name | Exam 1 | exam2 | Exam_3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94 | 95 | 86 |
1 | Mary Johnson | 89 | 92 | 96 | 82 |
2 | Robert Williams | 88 | 72 | absent | 91 |
3 | Jennifer Jones | 92 | excused | 94 | 99 |
4 | Linda Wilson | 84 | 92 | 89 | 94 |
print(grades)
Name Exam 1 exam2 Exam_3 exam4
0 John Smith 98 94 95 86
1 Mary Johnson 89 92 96 82
2 Robert Williams 88 72 absent 91
3 Jennifer Jones 92 excused 94 99
4 Linda Wilson 84 92 89 94
Cleaning Column Names#
grades.rename(str.lower, axis = 'columns')
name | exam 1 | exam2 | exam_3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94 | 95 | 86 |
1 | Mary Johnson | 89 | 92 | 96 | 82 |
2 | Robert Williams | 88 | 72 | absent | 91 |
3 | Jennifer Jones | 92 | excused | 94 | 99 |
4 | Linda Wilson | 84 | 92 | 89 | 94 |
grades
Name | Exam 1 | exam2 | Exam_3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94 | 95 | 86 |
1 | Mary Johnson | 89 | 92 | 96 | 82 |
2 | Robert Williams | 88 | 72 | absent | 91 |
3 | Jennifer Jones | 92 | excused | 94 | 99 |
4 | Linda Wilson | 84 | 92 | 89 | 94 |
grades = grades.rename(str.lower, axis = 'columns')
grades
name | exam 1 | exam2 | exam_3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94 | 95 | 86 |
1 | Mary Johnson | 89 | 92 | 96 | 82 |
2 | Robert Williams | 88 | 72 | absent | 91 |
3 | Jennifer Jones | 92 | excused | 94 | 99 |
4 | Linda Wilson | 84 | 92 | 89 | 94 |
grades.rename(columns = {'exam 1': 'exam1', 'exam_3': 'exam3'}, inplace = True)
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94 | 95 | 86 |
1 | Mary Johnson | 89 | 92 | 96 | 82 |
2 | Robert Williams | 88 | 72 | absent | 91 |
3 | Jennifer Jones | 92 | excused | 94 | 99 |
4 | Linda Wilson | 84 | 92 | 89 | 94 |
Indexing and Datatypes#
grades.dtypes
name object
exam1 int64
exam2 object
exam3 object
exam4 int64
dtype: object
grades['name']
0 John Smith
1 Mary Johnson
2 Robert Williams
3 Jennifer Jones
4 Linda Wilson
Name: name, dtype: object
grades.name
0 John Smith
1 Mary Johnson
2 Robert Williams
3 Jennifer Jones
4 Linda Wilson
Name: name, dtype: object
grades[['name']]
name | |
---|---|
0 | John Smith |
1 | Mary Johnson |
2 | Robert Williams |
3 | Jennifer Jones |
4 | Linda Wilson |
grades['name'][0]
'John Smith'
grades['name'][1]
'Mary Johnson'
grades.name[1]
'Mary Johnson'
grades['exam1'][1]
89
grades.exam2[1]
'92'
grades['exam3'][1]
'96'
grades.exam4[1]
82
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94 | 95 | 86 |
1 | Mary Johnson | 89 | 92 | 96 | 82 |
2 | Robert Williams | 88 | 72 | absent | 91 |
3 | Jennifer Jones | 92 | excused | 94 | 99 |
4 | Linda Wilson | 84 | 92 | 89 | 94 |
print(type(grades['exam3'][0]))
print(type(grades['exam3'][1]))
print(type(grades['exam3'][2]))
<class 'str'>
<class 'str'>
<class 'str'>
Assigning Values and Working with Missing Data#
grades['exam3'][2] = 0
C:\Users\bzhou01\AppData\Local\Temp\ipykernel_22684\2777045509.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
grades['exam3'][2] = 0
Oh no, a really scary warning! What is happening?
Because Python uses something called pass-by-object-reference and does a lot of optimization in the background, the end user (that is you) has little to no control over whether thay are referencing the original object or a copy. This warning is just Pandas letting us know that when using chained indexing to write a value, the behaviour is undefined, meaning that pandas cannot be sure wheter you are are writing to the original data frame or a temporary copy.
To learn more: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94 | 95 | 86 |
1 | Mary Johnson | 89 | 92 | 96 | 82 |
2 | Robert Williams | 88 | 72 | 0 | 91 |
3 | Jennifer Jones | 92 | excused | 94 | 99 |
4 | Linda Wilson | 84 | 92 | 89 | 94 |
Phew, this time we got lucky. However, with a differet data frame the same approach might actually write the changes to a temporary copy and leave the original data frame unchanged. Chained indexing is dangerous and you should avoid using it to write values. What should we use instead?
There are a lot of options: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
To write a singe value, use
.at[row, column]
To write a range of values (or a single value), use
.loc[row(s), column(s)]
Note that .at
and .loc
use row and column labels. The numbers 0, 1, 2, 3, and 4 that we see in front of the rows are actually row labels. By default, row labels match row indexes in pandas. However, quite often you will work with rows that have actual labels. Sometimes those labels might be numeric and resemble indexes, which leads to confusion and error. Hence, if you want to make sure you are using indexes, not labels, use .iat
and .iloc
instead.
grades.at[2, 'exam3']
0
grades.loc[3, 'exam2'] = np.NaN
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94 | 95 | 86 |
1 | Mary Johnson | 89 | 92 | 96 | 82 |
2 | Robert Williams | 88 | 72 | 0 | 91 |
3 | Jennifer Jones | 92 | NaN | 94 | 99 |
4 | Linda Wilson | 84 | 92 | 89 | 94 |
grades.dtypes
name object
exam1 int64
exam2 object
exam3 object
exam4 int64
dtype: object
grades['exam2'][0]
'94'
grades.exam3[0]
'95'
grades['exam2'] = pd.to_numeric(grades['exam2'])
grades['exam3'] = pd.to_numeric(grades.exam3)
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94.0 | 95 | 86 |
1 | Mary Johnson | 89 | 92.0 | 96 | 82 |
2 | Robert Williams | 88 | 72.0 | 0 | 91 |
3 | Jennifer Jones | 92 | NaN | 94 | 99 |
4 | Linda Wilson | 84 | 92.0 | 89 | 94 |
grades.dtypes
name object
exam1 int64
exam2 float64
exam3 int64
exam4 int64
dtype: object
Aggregating Data#
grades['sum'] = grades['exam1'] + grades['exam2'] + grades['exam3'] + grades['exam4']
grades
name | exam1 | exam2 | exam3 | exam4 | sum | |
---|---|---|---|---|---|---|
0 | John Smith | 98 | 94.0 | 95 | 86 | 373.0 |
1 | Mary Johnson | 89 | 92.0 | 96 | 82 | 359.0 |
2 | Robert Williams | 88 | 72.0 | 0 | 91 | 251.0 |
3 | Jennifer Jones | 92 | NaN | 94 | 99 | NaN |
4 | Linda Wilson | 84 | 92.0 | 89 | 94 | 359.0 |
grades.drop('sum', axis = 'columns', inplace = True)
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94.0 | 95 | 86 |
1 | Mary Johnson | 89 | 92.0 | 96 | 82 |
2 | Robert Williams | 88 | 72.0 | 0 | 91 |
3 | Jennifer Jones | 92 | NaN | 94 | 99 |
4 | Linda Wilson | 84 | 92.0 | 89 | 94 |
grades['sum'] = grades.sum(axis = 'columns')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[77], line 1
----> 1 grades['sum'] = grades.sum(axis = 'columns')
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11312, in DataFrame.sum(self, axis, skipna, numeric_only, min_count, **kwargs)
11303 @doc(make_doc("sum", ndim=2))
11304 def sum(
11305 self,
(...)
11310 **kwargs,
11311 ):
> 11312 result = super().sum(axis, skipna, numeric_only, min_count, **kwargs)
11313 return result.__finalize__(self, method="sum")
File C:\JumboConda\lib\site-packages\pandas\core\generic.py:12078, in NDFrame.sum(self, axis, skipna, numeric_only, min_count, **kwargs)
12070 def sum(
12071 self,
12072 axis: Axis | None = 0,
(...)
12076 **kwargs,
12077 ):
> 12078 return self._min_count_stat_function(
12079 "sum", nanops.nansum, axis, skipna, numeric_only, min_count, **kwargs
12080 )
File C:\JumboConda\lib\site-packages\pandas\core\generic.py:12061, in NDFrame._min_count_stat_function(self, name, func, axis, skipna, numeric_only, min_count, **kwargs)
12058 elif axis is lib.no_default:
12059 axis = 0
> 12061 return self._reduce(
12062 func,
12063 name=name,
12064 axis=axis,
12065 skipna=skipna,
12066 numeric_only=numeric_only,
12067 min_count=min_count,
12068 )
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11204, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
11200 df = df.T
11202 # After possibly _get_data and transposing, we are now in the
11203 # simple case where we can use BlockManager.reduce
> 11204 res = df._mgr.reduce(blk_func)
11205 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
11206 if out_dtype is not None and out.dtype != "boolean":
File C:\JumboConda\lib\site-packages\pandas\core\internals\managers.py:1459, in BlockManager.reduce(self, func)
1457 res_blocks: list[Block] = []
1458 for blk in self.blocks:
-> 1459 nbs = blk.reduce(func)
1460 res_blocks.extend(nbs)
1462 index = Index([None]) # placeholder
File C:\JumboConda\lib\site-packages\pandas\core\internals\blocks.py:377, in Block.reduce(self, func)
371 @final
372 def reduce(self, func) -> list[Block]:
373 # We will apply the function and reshape the result into a single-row
374 # Block with the same mgr_locs; squeezing will be done at a higher level
375 assert self.ndim == 2
--> 377 result = func(self.values)
379 if self.values.ndim == 1:
380 res_values = result
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11136, in DataFrame._reduce.<locals>.blk_func(values, axis)
11134 return np.array([result])
11135 else:
> 11136 return op(values, axis=axis, skipna=skipna, **kwds)
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:85, in disallow.__call__.<locals>._f(*args, **kwargs)
81 raise TypeError(
82 f"reduction operation '{f_name}' not allowed for this dtype"
83 )
84 try:
---> 85 return f(*args, **kwargs)
86 except ValueError as e:
87 # we want to transform an object array
88 # ValueError message to the more typical TypeError
89 # e.g. this is normally a disallowed function on
90 # object arrays that contain strings
91 if is_object_dtype(args[0]):
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
401 if datetimelike and mask is None:
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:477, in maybe_operate_rowwise.<locals>.newfunc(values, axis, **kwargs)
474 results = [func(x, **kwargs) for x in arrs]
475 return np.array(results)
--> 477 return func(values, axis=axis, **kwargs)
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:646, in nansum(values, axis, skipna, min_count, mask)
643 elif dtype.kind == "m":
644 dtype_sum = np.dtype(np.float64)
--> 646 the_sum = values.sum(axis, dtype=dtype_sum)
647 the_sum = _maybe_null_out(the_sum, axis, mask, values.shape, min_count=min_count)
649 return the_sum
File C:\JumboConda\lib\site-packages\numpy\core\_methods.py:49, in _sum(a, axis, dtype, out, keepdims, initial, where)
47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
48 initial=_NoValue, where=True):
---> 49 return umr_sum(a, axis, dtype, out, keepdims, initial, where)
TypeError: can only concatenate str (not "int") to str
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94.0 | 95 | 86 |
1 | Mary Johnson | 89 | 92.0 | 96 | 82 |
2 | Robert Williams | 88 | 72.0 | 0 | 91 |
3 | Jennifer Jones | 92 | NaN | 94 | 99 |
4 | Linda Wilson | 84 | 92.0 | 89 | 94 |
A Better Way of Working with Messy Data#
del grades
grades = pd.read_csv('data/grades.csv', na_values = 'excused')
grades
Name | Exam 1 | exam2 | Exam_3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94.0 | 95 | 86 |
1 | Mary Johnson | 89 | 92.0 | 96 | 82 |
2 | Robert Williams | 88 | 72.0 | absent | 91 |
3 | Jennifer Jones | 92 | NaN | 94 | 99 |
4 | Linda Wilson | 84 | 92.0 | 89 | 94 |
grades.rename(str.lower, axis = 'columns', inplace = True)
grades.rename(columns = {'exam 1': 'exam1', 'exam_3': 'exam3'}, inplace = True)
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94.0 | 95 | 86 |
1 | Mary Johnson | 89 | 92.0 | 96 | 82 |
2 | Robert Williams | 88 | 72.0 | absent | 91 |
3 | Jennifer Jones | 92 | NaN | 94 | 99 |
4 | Linda Wilson | 84 | 92.0 | 89 | 94 |
grades.dtypes
name object
exam1 int64
exam2 float64
exam3 object
exam4 int64
dtype: object
grades['exam3'] = pd.to_numeric(grades['exam3'], errors = 'coerce')
grades
name | exam1 | exam2 | exam3 | exam4 | |
---|---|---|---|---|---|
0 | John Smith | 98 | 94.0 | 95.0 | 86 |
1 | Mary Johnson | 89 | 92.0 | 96.0 | 82 |
2 | Robert Williams | 88 | 72.0 | NaN | 91 |
3 | Jennifer Jones | 92 | NaN | 94.0 | 99 |
4 | Linda Wilson | 84 | 92.0 | 89.0 | 94 |
grades.dtypes
name object
exam1 int64
exam2 float64
exam3 float64
exam4 int64
dtype: object
grades['exam3'] = grades['exam3'].fillna(0)
grades['name'][2]
'Robert Williams'
grades['mean'] = grades.mean(axis = 'columns')
grades
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[130], line 1
----> 1 grades['mean'] = grades.mean(axis = 'columns')
2 grades
3 grades["name"]
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11335, in DataFrame.mean(self, axis, skipna, numeric_only, **kwargs)
11327 @doc(make_doc("mean", ndim=2))
11328 def mean(
11329 self,
(...)
11333 **kwargs,
11334 ):
> 11335 result = super().mean(axis, skipna, numeric_only, **kwargs)
11336 if isinstance(result, Series):
11337 result = result.__finalize__(self, method="mean")
File C:\JumboConda\lib\site-packages\pandas\core\generic.py:11992, in NDFrame.mean(self, axis, skipna, numeric_only, **kwargs)
11985 def mean(
11986 self,
11987 axis: Axis | None = 0,
(...)
11990 **kwargs,
11991 ) -> Series | float:
> 11992 return self._stat_function(
11993 "mean", nanops.nanmean, axis, skipna, numeric_only, **kwargs
11994 )
File C:\JumboConda\lib\site-packages\pandas\core\generic.py:11949, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
11945 nv.validate_func(name, (), kwargs)
11947 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11949 return self._reduce(
11950 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11951 )
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11204, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
11200 df = df.T
11202 # After possibly _get_data and transposing, we are now in the
11203 # simple case where we can use BlockManager.reduce
> 11204 res = df._mgr.reduce(blk_func)
11205 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
11206 if out_dtype is not None and out.dtype != "boolean":
File C:\JumboConda\lib\site-packages\pandas\core\internals\managers.py:1459, in BlockManager.reduce(self, func)
1457 res_blocks: list[Block] = []
1458 for blk in self.blocks:
-> 1459 nbs = blk.reduce(func)
1460 res_blocks.extend(nbs)
1462 index = Index([None]) # placeholder
File C:\JumboConda\lib\site-packages\pandas\core\internals\blocks.py:377, in Block.reduce(self, func)
371 @final
372 def reduce(self, func) -> list[Block]:
373 # We will apply the function and reshape the result into a single-row
374 # Block with the same mgr_locs; squeezing will be done at a higher level
375 assert self.ndim == 2
--> 377 result = func(self.values)
379 if self.values.ndim == 1:
380 res_values = result
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11136, in DataFrame._reduce.<locals>.blk_func(values, axis)
11134 return np.array([result])
11135 else:
> 11136 return op(values, axis=axis, skipna=skipna, **kwds)
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
145 result = alt(values, axis=axis, skipna=skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
401 if datetimelike and mask is None:
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:719, in nanmean(values, axis, skipna, mask)
716 dtype_count = dtype
718 count = _get_counts(values.shape, mask, axis, dtype=dtype_count)
--> 719 the_sum = values.sum(axis, dtype=dtype_sum)
720 the_sum = _ensure_numeric(the_sum)
722 if axis is not None and getattr(the_sum, "ndim", False):
File C:\JumboConda\lib\site-packages\numpy\core\_methods.py:49, in _sum(a, axis, dtype, out, keepdims, initial, where)
47 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
48 initial=_NoValue, where=True):
---> 49 return umr_sum(a, axis, dtype, out, keepdims, initial, where)
TypeError: can only concatenate str (not "int") to str
grades.loc[:, 'max'] = grades.max(axis = 'columns')
grades.loc['mean'] = grades.mean(axis = 'rows')
grades.loc['max', :] = grades.max(axis = 'rows')
grades
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[87], line 1
----> 1 grades.loc[:, 'max'] = grades.max(axis = 'columns')
2 grades.loc['mean'] = grades.mean(axis = 'rows')
3 grades.loc['max', :] = grades.max(axis = 'rows')
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11298, in DataFrame.max(self, axis, skipna, numeric_only, **kwargs)
11290 @doc(make_doc("max", ndim=2))
11291 def max(
11292 self,
(...)
11296 **kwargs,
11297 ):
> 11298 result = super().max(axis, skipna, numeric_only, **kwargs)
11299 if isinstance(result, Series):
11300 result = result.__finalize__(self, method="max")
File C:\JumboConda\lib\site-packages\pandas\core\generic.py:11976, in NDFrame.max(self, axis, skipna, numeric_only, **kwargs)
11969 def max(
11970 self,
11971 axis: Axis | None = 0,
(...)
11974 **kwargs,
11975 ):
> 11976 return self._stat_function(
11977 "max",
11978 nanops.nanmax,
11979 axis,
11980 skipna,
11981 numeric_only,
11982 **kwargs,
11983 )
File C:\JumboConda\lib\site-packages\pandas\core\generic.py:11949, in NDFrame._stat_function(self, name, func, axis, skipna, numeric_only, **kwargs)
11945 nv.validate_func(name, (), kwargs)
11947 validate_bool_kwarg(skipna, "skipna", none_allowed=False)
> 11949 return self._reduce(
11950 func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
11951 )
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11204, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
11200 df = df.T
11202 # After possibly _get_data and transposing, we are now in the
11203 # simple case where we can use BlockManager.reduce
> 11204 res = df._mgr.reduce(blk_func)
11205 out = df._constructor_from_mgr(res, axes=res.axes).iloc[0]
11206 if out_dtype is not None and out.dtype != "boolean":
File C:\JumboConda\lib\site-packages\pandas\core\internals\managers.py:1459, in BlockManager.reduce(self, func)
1457 res_blocks: list[Block] = []
1458 for blk in self.blocks:
-> 1459 nbs = blk.reduce(func)
1460 res_blocks.extend(nbs)
1462 index = Index([None]) # placeholder
File C:\JumboConda\lib\site-packages\pandas\core\internals\blocks.py:377, in Block.reduce(self, func)
371 @final
372 def reduce(self, func) -> list[Block]:
373 # We will apply the function and reshape the result into a single-row
374 # Block with the same mgr_locs; squeezing will be done at a higher level
375 assert self.ndim == 2
--> 377 result = func(self.values)
379 if self.values.ndim == 1:
380 res_values = result
File C:\JumboConda\lib\site-packages\pandas\core\frame.py:11136, in DataFrame._reduce.<locals>.blk_func(values, axis)
11134 return np.array([result])
11135 else:
> 11136 return op(values, axis=axis, skipna=skipna, **kwds)
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:147, in bottleneck_switch.__call__.<locals>.f(values, axis, skipna, **kwds)
145 result = alt(values, axis=axis, skipna=skipna, **kwds)
146 else:
--> 147 result = alt(values, axis=axis, skipna=skipna, **kwds)
149 return result
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:404, in _datetimelike_compat.<locals>.new_func(values, axis, skipna, mask, **kwargs)
401 if datetimelike and mask is None:
402 mask = isna(values)
--> 404 result = func(values, axis=axis, skipna=skipna, mask=mask, **kwargs)
406 if datetimelike:
407 result = _wrap_results(result, orig_values.dtype, fill_value=iNaT)
File C:\JumboConda\lib\site-packages\pandas\core\nanops.py:1092, in _nanminmax.<locals>.reduction(values, axis, skipna, mask)
1087 return _na_for_min_count(values, axis)
1089 values, mask = _get_values(
1090 values, skipna, fill_value_typ=fill_value_typ, mask=mask
1091 )
-> 1092 result = getattr(values, meth)(axis)
1093 result = _maybe_null_out(result, axis, mask, values.shape)
1094 return result
File C:\JumboConda\lib\site-packages\numpy\core\_methods.py:41, in _amax(a, axis, out, keepdims, initial, where)
39 def _amax(a, axis=None, out=None, keepdims=False,
40 initial=_NoValue, where=True):
---> 41 return umr_maximum(a, axis, None, out, keepdims, initial, where)
TypeError: '>=' not supported between instances of 'str' and 'int'
Working with Real Data#
avocados = pd.read_csv('data/avocado.csv')
avocados
date | average_price | total_volume | 4046 | 4225 | 4770 | total_bags | small_bags | large_bags | xlarge_bags | type | year | geography | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-01-04 | 1.22 | 40873.28 | 2819.50 | 28287.42 | 49.90 | 9716.46 | 9186.93 | 529.53 | 0.00 | conventional | 2015 | Albany |
1 | 2015-01-04 | 1.79 | 1373.95 | 57.42 | 153.88 | 0.00 | 1162.65 | 1162.65 | 0.00 | 0.00 | organic | 2015 | Albany |
2 | 2015-01-04 | 1.00 | 435021.49 | 364302.39 | 23821.16 | 82.15 | 46815.79 | 16707.15 | 30108.64 | 0.00 | conventional | 2015 | Atlanta |
3 | 2015-01-04 | 1.76 | 3846.69 | 1500.15 | 938.35 | 0.00 | 1408.19 | 1071.35 | 336.84 | 0.00 | organic | 2015 | Atlanta |
4 | 2015-01-04 | 1.08 | 788025.06 | 53987.31 | 552906.04 | 39995.03 | 141136.68 | 137146.07 | 3990.61 | 0.00 | conventional | 2015 | Baltimore/Washington |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
30016 | 2020-05-17 | 1.58 | 2271254.00 | 150100.00 | 198457.00 | 5429.00 | 1917250.00 | 1121691.00 | 795559.00 | 0.00 | organic | 2020 | Total U.S. |
30017 | 2020-05-17 | 1.09 | 8667913.24 | 2081824.04 | 1020965.12 | 33410.85 | 5531562.87 | 2580802.48 | 2817078.77 | 133681.62 | conventional | 2020 | West |
30018 | 2020-05-17 | 1.71 | 384158.00 | 23455.00 | 39738.00 | 1034.00 | 319932.00 | 130051.00 | 189881.00 | 0.00 | organic | 2020 | West |
30019 | 2020-05-17 | 0.89 | 1240709.05 | 430203.10 | 126497.28 | 21104.42 | 662904.25 | 395909.35 | 265177.09 | 1817.81 | conventional | 2020 | West Tex/New Mexico |
30020 | 2020-05-17 | 1.58 | 36881.00 | 1147.00 | 1243.00 | 2645.00 | 31846.00 | 25621.00 | 6225.00 | 0.00 | organic | 2020 | West Tex/New Mexico |
30021 rows × 13 columns
avocados.head()
date | average_price | total_volume | 4046 | 4225 | 4770 | total_bags | small_bags | large_bags | xlarge_bags | type | year | geography | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015-01-04 | 1.22 | 40873.28 | 2819.50 | 28287.42 | 49.90 | 9716.46 | 9186.93 | 529.53 | 0.0 | conventional | 2015 | Albany |
1 | 2015-01-04 | 1.79 | 1373.95 | 57.42 | 153.88 | 0.00 | 1162.65 | 1162.65 | 0.00 | 0.0 | organic | 2015 | Albany |
2 | 2015-01-04 | 1.00 | 435021.49 | 364302.39 | 23821.16 | 82.15 | 46815.79 | 16707.15 | 30108.64 | 0.0 | conventional | 2015 | Atlanta |
3 | 2015-01-04 | 1.76 | 3846.69 | 1500.15 | 938.35 | 0.00 | 1408.19 | 1071.35 | 336.84 | 0.0 | organic | 2015 | Atlanta |
4 | 2015-01-04 | 1.08 | 788025.06 | 53987.31 | 552906.04 | 39995.03 | 141136.68 | 137146.07 | 3990.61 | 0.0 | conventional | 2015 | Baltimore/Washington |
avocados.shape
(30021, 13)
avocados.dtypes
date object
average_price float64
total_volume float64
4046 float64
4225 float64
4770 float64
total_bags float64
small_bags float64
large_bags float64
xlarge_bags float64
type object
year int64
geography object
dtype: object
Subsetting Data using Boolean Indexing#
avocados.geography
0 Albany
1 Albany
2 Atlanta
3 Atlanta
4 Baltimore/Washington
...
30016 Total U.S.
30017 West
30018 West
30019 West Tex/New Mexico
30020 West Tex/New Mexico
Name: geography, Length: 30021, dtype: object
avocados.geography == 'Boston'
0 False
1 False
2 False
3 False
4 False
...
30016 False
30017 False
30018 False
30019 False
30020 False
Name: geography, Length: 30021, dtype: bool
avocados[avocados.geography == 'Boston']
date | average_price | total_volume | 4046 | 4225 | 4770 | total_bags | small_bags | large_bags | xlarge_bags | type | year | geography | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 2015-01-04 | 1.02 | 491738.00 | 7193.87 | 396752.18 | 128.82 | 87663.13 | 87406.84 | 256.29 | 0.00 | conventional | 2015 | Boston |
9 | 2015-01-04 | 1.83 | 2192.13 | 8.66 | 939.43 | 0.00 | 1244.04 | 1244.04 | 0.00 | 0.00 | organic | 2015 | Boston |
116 | 2015-01-11 | 1.10 | 437771.89 | 5548.11 | 320577.36 | 121.81 | 111524.61 | 111192.88 | 331.73 | 0.00 | conventional | 2015 | Boston |
117 | 2015-01-11 | 1.94 | 2217.82 | 12.82 | 956.07 | 0.00 | 1248.93 | 1248.93 | 0.00 | 0.00 | organic | 2015 | Boston |
224 | 2015-01-18 | 1.23 | 401331.33 | 4383.76 | 287778.52 | 132.53 | 109036.52 | 108668.74 | 367.78 | 0.00 | conventional | 2015 | Boston |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
29706 | 2020-05-03 | 2.10 | 37432.00 | 109.00 | 2350.00 | 0.00 | 34972.00 | 32895.00 | 2077.00 | 0.00 | organic | 2020 | Boston |
29813 | 2020-05-10 | 1.43 | 1124288.41 | 27003.57 | 716906.87 | 597.73 | 379780.24 | 281283.46 | 94554.55 | 3942.23 | conventional | 2020 | Boston |
29814 | 2020-05-10 | 2.11 | 42554.00 | 98.00 | 2670.00 | 0.00 | 39786.00 | 38008.00 | 1778.00 | 0.00 | organic | 2020 | Boston |
29921 | 2020-05-17 | 1.50 | 941667.98 | 22010.56 | 574986.18 | 533.76 | 344137.48 | 237353.97 | 102717.96 | 4065.55 | conventional | 2020 | Boston |
29922 | 2020-05-17 | 2.04 | 44906.00 | 74.00 | 2495.00 | 0.00 | 42337.00 | 40460.00 | 1877.00 | 0.00 | organic | 2020 | Boston |
556 rows × 13 columns
avocados_boston = avocados[avocados.geography == 'Boston']
avocados_boston.head(10)
date | average_price | total_volume | 4046 | 4225 | 4770 | total_bags | small_bags | large_bags | xlarge_bags | type | year | geography | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 2015-01-04 | 1.02 | 491738.00 | 7193.87 | 396752.18 | 128.82 | 87663.13 | 87406.84 | 256.29 | 0.0 | conventional | 2015 | Boston |
9 | 2015-01-04 | 1.83 | 2192.13 | 8.66 | 939.43 | 0.00 | 1244.04 | 1244.04 | 0.00 | 0.0 | organic | 2015 | Boston |
116 | 2015-01-11 | 1.10 | 437771.89 | 5548.11 | 320577.36 | 121.81 | 111524.61 | 111192.88 | 331.73 | 0.0 | conventional | 2015 | Boston |
117 | 2015-01-11 | 1.94 | 2217.82 | 12.82 | 956.07 | 0.00 | 1248.93 | 1248.93 | 0.00 | 0.0 | organic | 2015 | Boston |
224 | 2015-01-18 | 1.23 | 401331.33 | 4383.76 | 287778.52 | 132.53 | 109036.52 | 108668.74 | 367.78 | 0.0 | conventional | 2015 | Boston |
225 | 2015-01-18 | 2.00 | 2209.34 | 4.22 | 834.34 | 0.00 | 1370.78 | 1370.78 | 0.00 | 0.0 | organic | 2015 | Boston |
332 | 2015-01-25 | 1.17 | 409343.56 | 4730.77 | 288094.65 | 45.98 | 116472.16 | 115538.05 | 934.11 | 0.0 | conventional | 2015 | Boston |
333 | 2015-01-25 | 2.01 | 1948.28 | 3.14 | 760.39 | 0.00 | 1184.75 | 1184.75 | 0.00 | 0.0 | organic | 2015 | Boston |
440 | 2015-02-01 | 1.22 | 490022.14 | 6082.88 | 373452.06 | 272.81 | 110214.39 | 109900.93 | 313.46 | 0.0 | conventional | 2015 | Boston |
441 | 2015-02-01 | 1.78 | 2943.85 | 12.18 | 1308.94 | 0.00 | 1622.73 | 1622.73 | 0.00 | 0.0 | organic | 2015 | Boston |
avocados_boston_copy = avocados[avocados.geography == 'Boston'].copy()
avocados_boston_copy.head(10)
date | average_price | total_volume | 4046 | 4225 | 4770 | total_bags | small_bags | large_bags | xlarge_bags | type | year | geography | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
8 | 2015-01-04 | 1.02 | 491738.00 | 7193.87 | 396752.18 | 128.82 | 87663.13 | 87406.84 | 256.29 | 0.0 | conventional | 2015 | Boston |
9 | 2015-01-04 | 1.83 | 2192.13 | 8.66 | 939.43 | 0.00 | 1244.04 | 1244.04 | 0.00 | 0.0 | organic | 2015 | Boston |
116 | 2015-01-11 | 1.10 | 437771.89 | 5548.11 | 320577.36 | 121.81 | 111524.61 | 111192.88 | 331.73 | 0.0 | conventional | 2015 | Boston |
117 | 2015-01-11 | 1.94 | 2217.82 | 12.82 | 956.07 | 0.00 | 1248.93 | 1248.93 | 0.00 | 0.0 | organic | 2015 | Boston |
224 | 2015-01-18 | 1.23 | 401331.33 | 4383.76 | 287778.52 | 132.53 | 109036.52 | 108668.74 | 367.78 | 0.0 | conventional | 2015 | Boston |
225 | 2015-01-18 | 2.00 | 2209.34 | 4.22 | 834.34 | 0.00 | 1370.78 | 1370.78 | 0.00 | 0.0 | organic | 2015 | Boston |
332 | 2015-01-25 | 1.17 | 409343.56 | 4730.77 | 288094.65 | 45.98 | 116472.16 | 115538.05 | 934.11 | 0.0 | conventional | 2015 | Boston |
333 | 2015-01-25 | 2.01 | 1948.28 | 3.14 | 760.39 | 0.00 | 1184.75 | 1184.75 | 0.00 | 0.0 | organic | 2015 | Boston |
440 | 2015-02-01 | 1.22 | 490022.14 | 6082.88 | 373452.06 | 272.81 | 110214.39 | 109900.93 | 313.46 | 0.0 | conventional | 2015 | Boston |
441 | 2015-02-01 | 1.78 | 2943.85 | 12.18 | 1308.94 | 0.00 | 1622.73 | 1622.73 | 0.00 | 0.0 | organic | 2015 | Boston |
np.mean(avocados_boston.average_price[avocados_boston.year == 2019])
1.5385576923076922
mean_2019 = np.mean(avocados_boston.average_price[avocados_boston.year == 2019])
print("The avereage price for avocados in the Boston area in the year 2019 was: $", round(mean_2019, 2))
The avereage price for avocados in the Boston area in the year 2019 was: $ 1.54
Creating Plots#
plt.plot(avocados_boston.date, avocados_boston.average_price)
plt.show()
![../../_images/d7cb3ef1ed3cdfdd80bd3a6beb842f9fc5c4f4e96e0f2c83c690d7d535310264.png](../../_images/d7cb3ef1ed3cdfdd80bd3a6beb842f9fc5c4f4e96e0f2c83c690d7d535310264.png)
plt.figure(figsize = (20, 8))
plt.plot(avocados_boston.date, avocados_boston.average_price, color = 'green', linestyle = '--', marker = 'o')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()
![../../_images/55dbcc89abf727dbd70e46a290234ef631410fe0efa51c7c7e874567af7461c4.png](../../_images/55dbcc89abf727dbd70e46a290234ef631410fe0efa51c7c7e874567af7461c4.png)
avocados_boston.plot(x = 'date', y = 'average_price', figsize = (18, 8), kind='line', color = 'green')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()
![../../_images/8d9503a80c284667c45358f367ae50044a57ac1e063a2463a75870048e30bca1.png](../../_images/8d9503a80c284667c45358f367ae50044a57ac1e063a2463a75870048e30bca1.png)
avocados_boston[avocados_boston.year == 2019].plot(x = 'date', y = 'average_price', figsize = (18, 8), kind='line', color = 'green')
plt.xlabel("Date")
plt.ylabel("Avocado Price [$]")
plt.title("Avocado Prices in Boston")
plt.show()
![../../_images/aacbca0917e34588a28e9f5ee0e71870f66608259a2f6570199d60174acde7cf.png](../../_images/aacbca0917e34588a28e9f5ee0e71870f66608259a2f6570199d60174acde7cf.png)
plt.hist(avocados.average_price)
plt.xlabel('Price')
plt.show()
![../../_images/b1557ff81c716c1005f1f2dd71d148e2899efb26e5e18586b15cc5de2daaf3bb.png](../../_images/b1557ff81c716c1005f1f2dd71d148e2899efb26e5e18586b15cc5de2daaf3bb.png)
sns.histplot(avocados.average_price, color = 'r', kde = True)
<Axes: xlabel='average_price', ylabel='Count'>
![../../_images/daa24bbe8933da534723593fbabc083193c059d0f4fb9a246409aa6c6908ed04.png](../../_images/daa24bbe8933da534723593fbabc083193c059d0f4fb9a246409aa6c6908ed04.png)
sns.histplot(avocados.average_price[avocados.year == 2019], color = 'r', kde = True)
<Axes: xlabel='average_price', ylabel='Count'>
![../../_images/5d759301a5bf7ccdb7b35c5e006d80c328c278eee8630e6fa8c816f0e878c7e6.png](../../_images/5d759301a5bf7ccdb7b35c5e006d80c328c278eee8630e6fa8c816f0e878c7e6.png)
sns.histplot(avocados.average_price[avocados.geography == 'Boston'], color = 'r', kde = True)
<Axes: xlabel='average_price', ylabel='Count'>
![../../_images/3e4305a2547675181e2a04b410f2b37319ca70a20b30e5b034832c1ced0871fd.png](../../_images/3e4305a2547675181e2a04b410f2b37319ca70a20b30e5b034832c1ced0871fd.png)
Combining Datasets and Long vs Wide Data#
pop = pd.read_csv('data/population.csv')
pop
Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | Population, total | SP.POP.TOTL | 54211.0 | 55438.0 | 56225.0 | 56695.0 | 57032.0 | 57360.0 | ... | 101669.0 | 102046.0 | 102560.0 | 103159.0 | 103774.0 | 104341.0 | 104872.0 | 105366.0 | 105845.0 | 106314.0 |
1 | Afghanistan | AFG | Population, total | SP.POP.TOTL | 8996973.0 | 9169410.0 | 9351441.0 | 9543205.0 | 9744781.0 | 9956320.0 | ... | 29185507.0 | 30117413.0 | 31161376.0 | 32269589.0 | 33370794.0 | 34413603.0 | 35383128.0 | 36296400.0 | 37172386.0 | 38041754.0 |
2 | Angola | AGO | Population, total | SP.POP.TOTL | 5454933.0 | 5531472.0 | 5608539.0 | 5679458.0 | 5735044.0 | 5770570.0 | ... | 23356246.0 | 24220661.0 | 25107931.0 | 26015780.0 | 26941779.0 | 27884381.0 | 28842484.0 | 29816748.0 | 30809762.0 | 31825295.0 |
3 | Albania | ALB | Population, total | SP.POP.TOTL | 1608800.0 | 1659800.0 | 1711319.0 | 1762621.0 | 1814135.0 | 1864791.0 | ... | 2913021.0 | 2905195.0 | 2900401.0 | 2895092.0 | 2889104.0 | 2880703.0 | 2876101.0 | 2873457.0 | 2866376.0 | 2854191.0 |
4 | Andorra | AND | Population, total | SP.POP.TOTL | 13411.0 | 14375.0 | 15370.0 | 16412.0 | 17469.0 | 18549.0 | ... | 84449.0 | 83747.0 | 82427.0 | 80774.0 | 79213.0 | 78011.0 | 77297.0 | 77001.0 | 77006.0 | 77142.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
259 | Kosovo | XKX | Population, total | SP.POP.TOTL | 947000.0 | 966000.0 | 994000.0 | 1022000.0 | 1050000.0 | 1078000.0 | ... | 1775680.0 | 1791000.0 | 1807106.0 | 1818117.0 | 1812771.0 | 1788196.0 | 1777557.0 | 1791003.0 | 1797085.0 | 1794248.0 |
260 | Yemen, Rep. | YEM | Population, total | SP.POP.TOTL | 5315355.0 | 5393036.0 | 5473671.0 | 5556766.0 | 5641597.0 | 5727751.0 | ... | 23154855.0 | 23807588.0 | 24473178.0 | 25147109.0 | 25823485.0 | 26497889.0 | 27168210.0 | 27834821.0 | 28498687.0 | 29161922.0 |
261 | South Africa | ZAF | Population, total | SP.POP.TOTL | 17099840.0 | 17524533.0 | 17965725.0 | 18423161.0 | 18896307.0 | 19384841.0 | ... | 51216964.0 | 52004172.0 | 52834005.0 | 53689236.0 | 54545991.0 | 55386367.0 | 56203654.0 | 57000451.0 | 57779622.0 | 58558270.0 |
262 | Zambia | ZMB | Population, total | SP.POP.TOTL | 3070776.0 | 3164329.0 | 3260650.0 | 3360104.0 | 3463213.0 | 3570464.0 | ... | 13605984.0 | 14023193.0 | 14465121.0 | 14926504.0 | 15399753.0 | 15879361.0 | 16363507.0 | 16853688.0 | 17351822.0 | 17861030.0 |
263 | Zimbabwe | ZWE | Population, total | SP.POP.TOTL | 3776681.0 | 3905034.0 | 4039201.0 | 4178726.0 | 4322861.0 | 4471177.0 | ... | 12697723.0 | 12894316.0 | 13115131.0 | 13350356.0 | 13586681.0 | 13814629.0 | 14030390.0 | 14236745.0 | 14439018.0 | 14645468.0 |
264 rows × 64 columns
pop.drop(labels = ['Indicator Name', 'Indicator Code'], axis = 1, inplace = True)
pop
Country Name | Country Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | 54211.0 | 55438.0 | 56225.0 | 56695.0 | 57032.0 | 57360.0 | 57715.0 | 58055.0 | ... | 101669.0 | 102046.0 | 102560.0 | 103159.0 | 103774.0 | 104341.0 | 104872.0 | 105366.0 | 105845.0 | 106314.0 |
1 | Afghanistan | AFG | 8996973.0 | 9169410.0 | 9351441.0 | 9543205.0 | 9744781.0 | 9956320.0 | 10174836.0 | 10399926.0 | ... | 29185507.0 | 30117413.0 | 31161376.0 | 32269589.0 | 33370794.0 | 34413603.0 | 35383128.0 | 36296400.0 | 37172386.0 | 38041754.0 |
2 | Angola | AGO | 5454933.0 | 5531472.0 | 5608539.0 | 5679458.0 | 5735044.0 | 5770570.0 | 5781214.0 | 5774243.0 | ... | 23356246.0 | 24220661.0 | 25107931.0 | 26015780.0 | 26941779.0 | 27884381.0 | 28842484.0 | 29816748.0 | 30809762.0 | 31825295.0 |
3 | Albania | ALB | 1608800.0 | 1659800.0 | 1711319.0 | 1762621.0 | 1814135.0 | 1864791.0 | 1914573.0 | 1965598.0 | ... | 2913021.0 | 2905195.0 | 2900401.0 | 2895092.0 | 2889104.0 | 2880703.0 | 2876101.0 | 2873457.0 | 2866376.0 | 2854191.0 |
4 | Andorra | AND | 13411.0 | 14375.0 | 15370.0 | 16412.0 | 17469.0 | 18549.0 | 19647.0 | 20758.0 | ... | 84449.0 | 83747.0 | 82427.0 | 80774.0 | 79213.0 | 78011.0 | 77297.0 | 77001.0 | 77006.0 | 77142.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
259 | Kosovo | XKX | 947000.0 | 966000.0 | 994000.0 | 1022000.0 | 1050000.0 | 1078000.0 | 1106000.0 | 1135000.0 | ... | 1775680.0 | 1791000.0 | 1807106.0 | 1818117.0 | 1812771.0 | 1788196.0 | 1777557.0 | 1791003.0 | 1797085.0 | 1794248.0 |
260 | Yemen, Rep. | YEM | 5315355.0 | 5393036.0 | 5473671.0 | 5556766.0 | 5641597.0 | 5727751.0 | 5816247.0 | 5907874.0 | ... | 23154855.0 | 23807588.0 | 24473178.0 | 25147109.0 | 25823485.0 | 26497889.0 | 27168210.0 | 27834821.0 | 28498687.0 | 29161922.0 |
261 | South Africa | ZAF | 17099840.0 | 17524533.0 | 17965725.0 | 18423161.0 | 18896307.0 | 19384841.0 | 19888250.0 | 20406864.0 | ... | 51216964.0 | 52004172.0 | 52834005.0 | 53689236.0 | 54545991.0 | 55386367.0 | 56203654.0 | 57000451.0 | 57779622.0 | 58558270.0 |
262 | Zambia | ZMB | 3070776.0 | 3164329.0 | 3260650.0 | 3360104.0 | 3463213.0 | 3570464.0 | 3681955.0 | 3797873.0 | ... | 13605984.0 | 14023193.0 | 14465121.0 | 14926504.0 | 15399753.0 | 15879361.0 | 16363507.0 | 16853688.0 | 17351822.0 | 17861030.0 |
263 | Zimbabwe | ZWE | 3776681.0 | 3905034.0 | 4039201.0 | 4178726.0 | 4322861.0 | 4471177.0 | 4623351.0 | 4779827.0 | ... | 12697723.0 | 12894316.0 | 13115131.0 | 13350356.0 | 13586681.0 | 13814629.0 | 14030390.0 | 14236745.0 | 14439018.0 | 14645468.0 |
264 rows × 62 columns
pop_long = pop.melt(id_vars = ['Country Name', 'Country Code'], var_name = 'Year', value_name = 'Population')
pop_long
Country Name | Country Code | Year | Population | |
---|---|---|---|---|
0 | Aruba | ABW | 1960 | 54211.0 |
1 | Afghanistan | AFG | 1960 | 8996973.0 |
2 | Angola | AGO | 1960 | 5454933.0 |
3 | Albania | ALB | 1960 | 1608800.0 |
4 | Andorra | AND | 1960 | 13411.0 |
... | ... | ... | ... | ... |
15835 | Kosovo | XKX | 2019 | 1794248.0 |
15836 | Yemen, Rep. | YEM | 2019 | 29161922.0 |
15837 | South Africa | ZAF | 2019 | 58558270.0 |
15838 | Zambia | ZMB | 2019 | 17861030.0 |
15839 | Zimbabwe | ZWE | 2019 | 14645468.0 |
15840 rows × 4 columns
gdp = pd.read_csv('data/gdp.csv').drop(labels = ['Indicator Name', 'Indicator Code'], axis = 1)
gdp
Country Name | Country Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2.390503e+09 | 2.549721e+09 | 2.534637e+09 | 2.701676e+09 | 2.765363e+09 | 2.919553e+09 | 2.965922e+09 | 3.056425e+09 | NaN | NaN |
1 | Afghanistan | AFG | 5.377778e+08 | 5.488889e+08 | 5.466667e+08 | 7.511112e+08 | 8.000000e+08 | 1.006667e+09 | 1.400000e+09 | 1.673333e+09 | ... | 1.585657e+10 | 1.780429e+10 | 2.000160e+10 | 2.056107e+10 | 2.048489e+10 | 1.990711e+10 | 1.936264e+10 | 2.019176e+10 | 1.948438e+10 | 1.910135e+10 |
2 | Angola | AGO | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 8.379950e+10 | 1.117900e+11 | 1.280530e+11 | 1.367100e+11 | 1.457120e+11 | 1.161940e+11 | 1.011240e+11 | 1.221240e+11 | 1.013530e+11 | 9.463542e+10 |
3 | Albania | ALB | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.192693e+10 | 1.289077e+10 | 1.231983e+10 | 1.277622e+10 | 1.322814e+10 | 1.138685e+10 | 1.186120e+10 | 1.301969e+10 | 1.514702e+10 | 1.527808e+10 |
4 | Andorra | AND | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.449967e+09 | 3.629204e+09 | 3.188809e+09 | 3.193704e+09 | 3.271808e+09 | 2.789870e+09 | 2.896679e+09 | 3.000181e+09 | 3.218316e+09 | 3.154058e+09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
259 | Kosovo | XKX | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 5.835874e+09 | 6.701698e+09 | 6.499807e+09 | 7.074778e+09 | 7.396705e+09 | 6.442916e+09 | 6.719172e+09 | 7.245707e+09 | 7.942962e+09 | 7.926108e+09 |
260 | Yemen, Rep. | YEM | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.090675e+10 | 3.272642e+10 | 3.540134e+10 | 4.041524e+10 | 4.320647e+10 | 3.697620e+10 | 2.808468e+10 | 2.456133e+10 | 2.759126e+10 | NaN |
261 | South Africa | ZAF | 7.575397e+09 | 7.972997e+09 | 8.497997e+09 | 9.423396e+09 | 1.037400e+10 | 1.133440e+10 | 1.235500e+10 | 1.377739e+10 | ... | 3.753490e+11 | 4.164190e+11 | 3.963330e+11 | 3.668290e+11 | 3.509050e+11 | 3.176210e+11 | 2.963570e+11 | 3.495540e+11 | 3.682890e+11 | 3.514320e+11 |
262 | Zambia | ZMB | 7.130000e+08 | 6.962857e+08 | 6.931429e+08 | 7.187143e+08 | 8.394286e+08 | 1.082857e+09 | 1.264286e+09 | 1.368000e+09 | ... | 2.026556e+10 | 2.345952e+10 | 2.550306e+10 | 2.804551e+10 | 2.715065e+10 | 2.124335e+10 | 2.095476e+10 | 2.586814e+10 | 2.700524e+10 | 2.306472e+10 |
263 | Zimbabwe | ZWE | 1.052990e+09 | 1.096647e+09 | 1.117602e+09 | 1.159512e+09 | 1.217138e+09 | 1.311436e+09 | 1.281750e+09 | 1.397002e+09 | ... | 1.204166e+10 | 1.410192e+10 | 1.711485e+10 | 1.909102e+10 | 1.949552e+10 | 1.996312e+10 | 2.054868e+10 | 2.204090e+10 | 2.431156e+10 | 2.144076e+10 |
264 rows × 62 columns
gdp_long = gdp.melt(id_vars = ['Country Name', 'Country Code'], var_name = 'Year', value_name = 'GDP')
gdp_long
Country Name | Country Code | Year | GDP | |
---|---|---|---|---|
0 | Aruba | ABW | 1960 | NaN |
1 | Afghanistan | AFG | 1960 | 5.377778e+08 |
2 | Angola | AGO | 1960 | NaN |
3 | Albania | ALB | 1960 | NaN |
4 | Andorra | AND | 1960 | NaN |
... | ... | ... | ... | ... |
15835 | Kosovo | XKX | 2019 | 7.926108e+09 |
15836 | Yemen, Rep. | YEM | 2019 | NaN |
15837 | South Africa | ZAF | 2019 | 3.514320e+11 |
15838 | Zambia | ZMB | 2019 | 2.306472e+10 |
15839 | Zimbabwe | ZWE | 2019 | 2.144076e+10 |
15840 rows × 4 columns
countries = pop_long.merge(gdp_long.drop(labels = ['Country Name'], axis = 1),
on = ['Country Code', 'Year'], how = 'inner')
countries
Country Name | Country Code | Year | Population | GDP | |
---|---|---|---|---|---|
0 | Aruba | ABW | 1960 | 54211.0 | NaN |
1 | Afghanistan | AFG | 1960 | 8996973.0 | 5.377778e+08 |
2 | Angola | AGO | 1960 | 5454933.0 | NaN |
3 | Albania | ALB | 1960 | 1608800.0 | NaN |
4 | Andorra | AND | 1960 | 13411.0 | NaN |
... | ... | ... | ... | ... | ... |
15835 | Kosovo | XKX | 2019 | 1794248.0 | 7.926108e+09 |
15836 | Yemen, Rep. | YEM | 2019 | 29161922.0 | NaN |
15837 | South Africa | ZAF | 2019 | 58558270.0 | 3.514320e+11 |
15838 | Zambia | ZMB | 2019 | 17861030.0 | 2.306472e+10 |
15839 | Zimbabwe | ZWE | 2019 | 14645468.0 | 2.144076e+10 |
15840 rows × 5 columns
countries['GDP per capita'] = countries['GDP'] / countries['Population']
countries
Country Name | Country Code | Year | Population | GDP | GDP per capita | |
---|---|---|---|---|---|---|
0 | Aruba | ABW | 1960 | 54211.0 | NaN | NaN |
1 | Afghanistan | AFG | 1960 | 8996973.0 | 5.377778e+08 | 59.773194 |
2 | Angola | AGO | 1960 | 5454933.0 | NaN | NaN |
3 | Albania | ALB | 1960 | 1608800.0 | NaN | NaN |
4 | Andorra | AND | 1960 | 13411.0 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... |
15835 | Kosovo | XKX | 2019 | 1794248.0 | 7.926108e+09 | 4417.509940 |
15836 | Yemen, Rep. | YEM | 2019 | 29161922.0 | NaN | NaN |
15837 | South Africa | ZAF | 2019 | 58558270.0 | 3.514320e+11 | 6001.406804 |
15838 | Zambia | ZMB | 2019 | 17861030.0 | 2.306472e+10 | 1291.343357 |
15839 | Zimbabwe | ZWE | 2019 | 14645468.0 | 2.144076e+10 | 1463.985910 |
15840 rows × 6 columns
random_countries = np.random.choice(countries['Country Code'].unique(), 10)
countries_select = countries[countries['Country Code'].isin(random_countries)]
countries_select
Country Name | Country Code | Year | Population | GDP | GDP per capita | |
---|---|---|---|---|---|---|
0 | Aruba | ABW | 1960 | 5.421100e+04 | NaN | NaN |
3 | Albania | ALB | 1960 | 1.608800e+06 | NaN | NaN |
6 | United Arab Emirates | ARE | 1960 | 9.241800e+04 | NaN | NaN |
66 | Euro area | EMU | 1960 | 2.652039e+08 | 2.448960e+11 | 923.425216 |
67 | Eritrea | ERI | 1960 | 1.007590e+06 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... |
15718 | Luxembourg | LUX | 2019 | 6.198960e+05 | 7.110492e+10 | 114704.594171 |
15757 | Other small states | OSS | 2019 | 3.136041e+07 | 4.384680e+11 | 13981.578747 |
15775 | Romania | ROU | 2019 | 1.935654e+07 | 2.500770e+11 | 12919.506705 |
15778 | South Asia | SAS | 2019 | 1.835777e+09 | 3.597970e+12 | 1959.916976 |
15803 | Chad | TCD | 2019 | 1.594688e+07 | 1.131495e+10 | 709.540310 |
600 rows × 6 columns
for name, data in countries_select.groupby('Country Name'):
data.plot(x = 'Year', y = 'GDP per capita', label = name, figsize = (18, 8), ax = plt.gca())
plt.show()
![../../_images/1042bab2e75d7e4184cc340b4beeab2e221a31e3ca065b62c269d87d80db01fe.png](../../_images/1042bab2e75d7e4184cc340b4beeab2e221a31e3ca065b62c269d87d80db01fe.png)