Advanced Pandas Techniques
Contents
# Install the necessary dependencies
import os
import sys
import numpy as np
import pandas as pd
!{sys.executable} -m pip install --quiet jupyterlab_myst ipython
5.4.3. Advanced Pandas Techniques#
5.4.3.1. Overview#
In this section, we’ll continue to introduce combining datasets: concat, merge and join along with data aggregation and grouping.
5.4.3.2. Combining datasets: concat, merge and join#
5.4.3.2.1. concat#
Concatenate Pandas objects along a particular axis.
Allows optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
For example:
Combine two Series
.
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])
0 a
1 b
0 c
1 d
dtype: object
Clear the existing index and reset it in the result by setting the ignore_index
option to True
.
pd.concat([s1, s2], ignore_index=True)
0 a
1 b
2 c
3 d
dtype: object
Add a hierarchical index at the outermost level of the data with the keys
option.
pd.concat([s1, s2], keys=['s1', 's2'])
s1 0 a
1 b
s2 0 c
1 d
dtype: object
Label the index keys you create with the names
option.
pd.concat([s1, s2], keys=['s1', 's2'],
names=['Series name', 'Row ID'])
Series name Row ID
s1 0 a
1 b
s2 0 c
1 d
dtype: object
Combine two DataFrame
objects with identical columns.
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df1
letter | number | |
---|---|---|
0 | a | 1 |
1 | b | 2 |
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
df2
letter | number | |
---|---|---|
0 | c | 3 |
1 | d | 4 |
pd.concat([df1, df2])
letter | number | |
---|---|---|
0 | a | 1 |
1 | b | 2 |
0 | c | 3 |
1 | d | 4 |
Combine DataFrame
objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN
values.
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
columns=['letter', 'number', 'animal'])
df3
letter | number | animal | |
---|---|---|---|
0 | c | 3 | cat |
1 | d | 4 | dog |
pd.concat([df1, df3], sort=False)
letter | number | animal | |
---|---|---|---|
0 | a | 1 | NaN |
1 | b | 2 | NaN |
0 | c | 3 | cat |
1 | d | 4 | dog |
Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.
pd.concat([df1, df3], join="inner")
letter | number | |
---|---|---|
0 | a | 1 |
1 | b | 2 |
0 | c | 3 |
1 | d | 4 |
Combine DataFrame
objects horizontally along the x-axis by passing in axis=1
.
df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
columns=['animal', 'name'])
pd.concat([df1, df4], axis=1)
letter | number | animal | name | |
---|---|---|---|---|
0 | a | 1 | bird | polly |
1 | b | 2 | monkey | george |
Prevent the result from including duplicate index values with the verify_integrity
option.
df5 = pd.DataFrame([1], index=['a'])
df5
0 | |
---|---|
a | 1 |
df6 = pd.DataFrame([2], index=['a'])
df6
0 | |
---|---|
a | 2 |
pd.concat([df5, df6], verify_integrity=True)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[15], line 1
----> 1 pd.concat([df5, df6], verify_integrity=True)
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
325 if len(args) > num_allow_args:
326 warnings.warn(
327 msg.format(arguments=_format_argument_list(allow_args)),
328 FutureWarning,
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:368, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
146 @deprecate_nonkeyword_arguments(version=None, allowed_args=["objs"])
147 def concat(
148 objs: Iterable[NDFrame] | Mapping[HashableT, NDFrame],
(...)
157 copy: bool = True,
158 ) -> DataFrame | Series:
159 """
160 Concatenate pandas objects along a particular axis.
161
(...)
366 1 3 4
367 """
--> 368 op = _Concatenator(
369 objs,
370 axis=axis,
371 ignore_index=ignore_index,
372 join=join,
373 keys=keys,
374 levels=levels,
375 names=names,
376 verify_integrity=verify_integrity,
377 copy=copy,
378 sort=sort,
379 )
381 return op.get_result()
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:563, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
560 self.verify_integrity = verify_integrity
561 self.copy = copy
--> 563 self.new_axes = self._get_new_axes()
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:633, in _Concatenator._get_new_axes(self)
631 def _get_new_axes(self) -> list[Index]:
632 ndim = self._get_result_dim()
--> 633 return [
634 self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
635 for i in range(ndim)
636 ]
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:634, in <listcomp>(.0)
631 def _get_new_axes(self) -> list[Index]:
632 ndim = self._get_result_dim()
633 return [
--> 634 self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
635 for i in range(ndim)
636 ]
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/_libs/properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:697, in _Concatenator._get_concat_axis(self)
692 else:
693 concat_axis = _make_concat_multiindex(
694 indexes, self.keys, self.levels, self.names
695 )
--> 697 self._maybe_check_integrity(concat_axis)
699 return concat_axis
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:705, in _Concatenator._maybe_check_integrity(self, concat_index)
703 if not concat_index.is_unique:
704 overlap = concat_index[concat_index.duplicated()].unique()
--> 705 raise ValueError(f"Indexes have overlapping values: {overlap}")
ValueError: Indexes have overlapping values: Index(['a'], dtype='object')
Append a single row to the end of a DataFrame
object.
df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0])
df7
a | b | |
---|---|---|
0 | 1 | 2 |
new_row = pd.Series({'a': 3, 'b': 4})
new_row
a 3
b 4
dtype: int64
pd.concat([df7, new_row.to_frame().T], ignore_index=True)
a | b | |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
Note
append()
has been deprecated since version 1.4.0: Use concat()
instead.
5.4.3.2.2. merge#
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross-merge, no column specifications to merge on are allowed.
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
For example:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
Merge DataFrames df1
and df2
with specified left and right suffixes appended to any overlapping columns.
df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_left', '_right'))
lkey | value_left | rkey | value_right | |
---|---|---|---|---|
0 | foo | 1 | foo | 5 |
1 | foo | 1 | foo | 8 |
2 | foo | 5 | foo | 5 |
3 | foo | 5 | foo | 8 |
4 | bar | 2 | bar | 6 |
5 | baz | 3 | baz | 7 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:800px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf1%20%3D%20pd.DataFrame%28%7B'lkey'%3A%20%5B'foo',%20'bar',%20'baz',%20'foo'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'value'%3A%20%5B1,%202,%203,%205%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'rkey'%3A%20%5B'foo',%20'bar',%20'baz',%20'foo'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'value'%3A%20%5B5,%206,%207,%208%5D%7D%29%0Adf1.merge%28df2,%20left_on%3D'lkey',%20right_on%3D'rkey'%29&d=2023-05-27&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
Merge DataFrames df1
and df2
, but raise an exception if the DataFrames have any overlapping columns.
df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[22], line 1
----> 1 df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/frame.py:10090, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
10071 @Substitution("")
10072 @Appender(_merge_doc, indents=2)
10073 def merge(
(...)
10086 validate: str | None = None,
10087 ) -> DataFrame:
10088 from pandas.core.reshape.merge import merge
> 10090 return merge(
10091 self,
10092 right,
10093 how=how,
10094 on=on,
10095 left_on=left_on,
10096 right_on=right_on,
10097 left_index=left_index,
10098 right_index=right_index,
10099 sort=sort,
10100 suffixes=suffixes,
10101 copy=copy,
10102 indicator=indicator,
10103 validate=validate,
10104 )
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/merge.py:124, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
93 @Substitution("\nleft : DataFrame or named Series")
94 @Appender(_merge_doc, indents=0)
95 def merge(
(...)
108 validate: str | None = None,
109 ) -> DataFrame:
110 op = _MergeOperation(
111 left,
112 right,
(...)
122 validate=validate,
123 )
--> 124 return op.get_result(copy=copy)
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/merge.py:775, in _MergeOperation.get_result(self, copy)
771 self.left, self.right = self._indicator_pre_merge(self.left, self.right)
773 join_index, left_indexer, right_indexer = self._get_join_info()
--> 775 result = self._reindex_and_concat(
776 join_index, left_indexer, right_indexer, copy=copy
777 )
778 result = result.__finalize__(self, method=self._merge_type)
780 if self.indicator:
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/merge.py:729, in _MergeOperation._reindex_and_concat(self, join_index, left_indexer, right_indexer, copy)
726 left = self.left[:]
727 right = self.right[:]
--> 729 llabels, rlabels = _items_overlap_with_suffix(
730 self.left._info_axis, self.right._info_axis, self.suffixes
731 )
733 if left_indexer is not None:
734 # Pinning the index here (and in the right code just below) is not
735 # necessary, but makes the `.take` more performant if we have e.g.
736 # a MultiIndex for left.index.
737 lmgr = left._mgr.reindex_indexer(
738 join_index,
739 left_indexer,
(...)
744 use_na_proxy=True,
745 )
File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/merge.py:2458, in _items_overlap_with_suffix(left, right, suffixes)
2455 lsuffix, rsuffix = suffixes
2457 if not lsuffix and not rsuffix:
-> 2458 raise ValueError(f"columns overlap but no suffix specified: {to_rename}")
2460 def renamer(x, suffix):
2461 """
2462 Rename the left and right indices.
2463
(...)
2474 x : renamed column name
2475 """
ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Using how
parameter decide the type of merge to be performed.
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
df1.merge(df2, how='inner', on='a')
a | b | c | |
---|---|---|---|
0 | foo | 1 | 3 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:600px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'bar'%5D,%20'b'%3A%20%5B1,%202%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'baz'%5D,%20'c'%3A%20%5B3,%204%5D%7D%29%0Adf1.merge%28df2,%20how%3D'inner',%20on%3D'a'%29&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
df1.merge(df2, how='left', on='a')
a | b | c | |
---|---|---|---|
0 | foo | 1 | 3.0 |
1 | bar | 2 | NaN |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:630px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'bar'%5D,%20'b'%3A%20%5B1,%202%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'baz'%5D,%20'c'%3A%20%5B3,%204%5D%7D%29%0Adf1.merge%28df2,%20how%3D'left',%20on%3D'a'%29&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
df1 = pd.DataFrame({'left': ['foo', 'bar']})
df2 = pd.DataFrame({'right': [7, 8]})
df1.merge(df2, how='cross')
left | right | |
---|---|---|
0 | foo | 7 |
1 | foo | 8 |
2 | bar | 7 |
3 | bar | 8 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:670px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'left'%3A%20%5B'foo',%20'bar'%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'right'%3A%20%5B7,%208%5D%7D%29%0Adf1.merge%28df2,%20how%3D'cross'%29&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
5.4.3.2.3. join#
Join columns of another DataFrame.
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
For example:
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
'B': ['B0', 'B1', 'B2']})
Join DataFrames using their indexes.
df.join(other, lsuffix='_caller', rsuffix='_other')
key_caller | A | key_other | B | |
---|---|---|---|---|
0 | K0 | A0 | K0 | B0 |
1 | K1 | A1 | K1 | B1 |
2 | K2 | A2 | K2 | B2 |
3 | K3 | A3 | NaN | NaN |
4 | K4 | A4 | NaN | NaN |
5 | K5 | A5 | NaN | NaN |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:830px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2',%20'K3',%20'K4',%20'K5'%5D,'A'%3A%20%5B'A0',%20'A1',%20'A2',%20'A3',%20'A4',%20'A5'%5D%7D%29%0Aother%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2'%5D,'B'%3A%20%5B'B0',%20'B1',%20'B2'%5D%7D%29%0Adf.join%28other,%20lsuffix%3D'_caller',%20rsuffix%3D'_other'%29&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
If we want to join using the key
columns, we need to set key
to be the index in both df
and other
. The joined DataFrame will have key
as its index.
df.set_index('key').join(other.set_index('key'))
A | B | |
---|---|---|
key | ||
K0 | A0 | B0 |
K1 | A1 | B1 |
K2 | A2 | B2 |
K3 | A3 | NaN |
K4 | A4 | NaN |
K5 | A5 | NaN |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:1170px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2',%20'K3',%20'K4',%20'K5'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'A'%3A%20%5B'A0',%20'A1',%20'A2',%20'A3',%20'A4',%20'A5'%5D%7D%29%0Aother%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'B'%3A%20%5B'B0',%20'B1',%20'B2'%5D%7D%29%0Adf.set_index%28'key'%29.join%28other.set_index%28'key'%29%29&d=2023-05-27&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
Another option to join using the key columns is to use the on
parameter. DataFrame.join
always uses other
’s index but we can use any column in df
. This method preserves the original DataFrame’s index in the result.
df.join(other.set_index('key'), on='key')
key | A | B | |
---|---|---|---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K2 | A2 | B2 |
3 | K3 | A3 | NaN |
4 | K4 | A4 | NaN |
5 | K5 | A5 | NaN |
Using non-unique key values shows how they are matched.
df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
df
key | A | |
---|---|---|
0 | K0 | A0 |
1 | K1 | A1 |
2 | K1 | A2 |
3 | K3 | A3 |
4 | K0 | A4 |
5 | K1 | A5 |
df.join(other.set_index('key'), on='key', validate='m:1')
key | A | B | |
---|---|---|---|
0 | K0 | A0 | B0 |
1 | K1 | A1 | B1 |
2 | K1 | A2 | B1 |
3 | K3 | A3 | NaN |
4 | K0 | A4 | B0 |
5 | K1 | A5 | B1 |
5.4.3.3. Aggregation and grouping#
Group DataFrame
using a mapper or by a Series
of columns.
A groupby
operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
For example:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
df
df.groupby(['Animal']).mean()
Max Speed | |
---|---|
Animal | |
Falcon | 375.0 |
Parrot | 25.0 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:785px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B'Animal'%3A%20%5B'Falcon',%20'Falcon',%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'Max%20Speed'%3A%20%5B380.,%20370.,%2024.,%2026.%5D%7D%29%0Adf%0Adf.groupby%28%5B'Animal'%5D%29.mean%28%29&d=2023-05-27&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
5.4.3.3.1. Hierarchical Indexes#
We can groupby
different levels of a hierarchical index using the level
parameter:
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
index=index)
df.groupby(level=0).mean()
Max Speed | |
---|---|
Animal | |
Falcon | 370.0 |
Parrot | 25.0 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:810px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aarrays%20%3D%20%5B%5B'Falcon',%20'Falcon',%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%5B'Captive',%20'Wild',%20'Captive',%20'Wild'%5D%5D%0Aindex%20%3D%20pd.MultiIndex.from_arrays%28arrays,%20names%3D%28'Animal',%20'Type'%29%29%0Adf%20%3D%20pd.DataFrame%28%7B'Max%20Speed'%3A%20%5B390.,%20350.,%2030.,%2020.%5D%7D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20index%3Dindex%29%0Adf.groupby%28level%3D0%29.mean%28%29&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
df.groupby(level="Type").mean()
Max Speed | |
---|---|
Type | |
Captive | 210.0 |
Wild | 185.0 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:810px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aarrays%20%3D%20%5B%5B'Falcon',%20'Falcon',%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%5B'Captive',%20'Wild',%20'Captive',%20'Wild'%5D%5D%0Aindex%20%3D%20pd.MultiIndex.from_arrays%28arrays,%20names%3D%28'Animal',%20'Type'%29%29%0Adf%20%3D%20pd.DataFrame%28%7B'Max%20Speed'%3A%20%5B390.,%20350.,%2030.,%2020.%5D%7D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20index%3Dindex%29%0Adf.groupby%28level%3D%22Type%22%29.mean%28%29&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
We can also choose to include NA in group keys or not by setting dropna
parameter, the default setting is True
.
l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df.groupby(by=["b"]).sum()
a | c | |
---|---|---|
b | ||
1.0 | 2 | 3 |
2.0 | 2 | 5 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:770px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B1,%202,%203%5D,%20%5B1,%20None,%204%5D,%20%5B2,%201,%203%5D,%20%5B1,%202,%202%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%5B%22b%22%5D%29.sum%28%29%0A&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
df.groupby(by=["b"], dropna=False).sum()
a | c | |
---|---|---|
b | ||
1.0 | 2 | 3 |
2.0 | 2 | 5 |
NaN | 1 | 4 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:750px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Al%20%3D%20%5B%5B1,%202,%203%5D,%20%5B1,%20None,%204%5D,%20%5B2,%201,%203%5D,%20%5B1,%202,%202%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%5B%22b%22%5D,%20dropna%3DFalse%29.sum%28%29&d=2023-05-27&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df.groupby(by="a").sum()
b | c | |
---|---|---|
a | ||
a | 13.0 | 13.0 |
b | 12.3 | 123.0 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:750px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B%22a%22,%2012,%2012%5D,%20%5BNone,%2012.3,%2033.%5D,%20%5B%22b%22,%2012.3,%20123%5D,%20%5B%22a%22,%201,%201%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%22a%22%29.sum%28%29&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
df.groupby(by="a", dropna=False).sum()
b | c | |
---|---|---|
a | ||
a | 13.0 | 13.0 |
b | 12.3 | 123.0 |
NaN | 12.3 | 33.0 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:750px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B%22a%22,%2012,%2012%5D,%20%5BNone,%2012.3,%2033.%5D,%20%5B%22b%22,%2012.3,%20123%5D,%20%5B%22a%22,%201,%201%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%22a%22,%20dropna%3DFalse%29.sum%28%29%0A&d=2023-07-15&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
When using .apply()
, use group_keys
to include or exclude the group keys. The group_keys
argument defaults to True
(include).
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
df.groupby("Animal", group_keys=True).apply(lambda x: x)
Animal | Max Speed | ||
---|---|---|---|
Animal | |||
Falcon | 0 | Falcon | 380.0 |
1 | Falcon | 370.0 | |
Parrot | 2 | Parrot | 24.0 |
3 | Parrot | 26.0 |
df.groupby("Animal", group_keys=False).apply(lambda x: x)
Animal | Max Speed | |
---|---|---|
0 | Falcon | 380.0 |
1 | Falcon | 370.0 |
2 | Parrot | 24.0 |
3 | Parrot | 26.0 |
5.4.3.4. Pivot table#
Create a spreadsheet-style pivot table as a DataFrame.
The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
"bar", "bar", "bar", "bar"],
"B": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"C": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
df
A | B | C | D | E | |
---|---|---|---|---|---|
0 | foo | one | small | 1 | 2 |
1 | foo | one | large | 2 | 4 |
2 | foo | one | large | 2 | 5 |
3 | foo | two | small | 3 | 5 |
4 | foo | two | small | 3 | 6 |
5 | bar | one | large | 4 | 6 |
6 | bar | one | small | 5 | 8 |
7 | bar | two | small | 6 | 9 |
8 | bar | two | large | 7 | 9 |
This first example aggregates values by taking the sum.
table = pd.pivot_table(df, values='D', index=['A', 'B'],
columns=['C'], aggfunc=np.sum)
table
C | large | small | |
---|---|---|---|
A | B | ||
bar | one | 4.0 | 5.0 |
two | 7.0 | 6.0 | |
foo | one | 4.0 | 1.0 |
two | NaN | 6.0 |
We can also fill in missing values using the fill_value
parameter.
table = pd.pivot_table(df, values='D', index=['A', 'B'],
columns=['C'], aggfunc=np.sum, fill_value=0)
table
C | large | small | |
---|---|---|---|
A | B | ||
bar | one | 4 | 5 |
two | 7 | 6 | |
foo | one | 4 | 1 |
two | 0 | 6 |
The next example aggregates by taking the mean across multiple columns.
table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
aggfunc={'D': np.mean,
'E': np.mean})
table
D | E | ||
---|---|---|---|
A | C | ||
bar | large | 5.500000 | 7.500000 |
small | 5.500000 | 8.500000 | |
foo | large | 2.000000 | 4.500000 |
small | 2.333333 | 4.333333 |
We can also calculate multiple types of aggregations for any given value column.
table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
aggfunc={'D': np.mean,
'E': [min, max, np.mean]})
table
D | E | ||||
---|---|---|---|---|---|
mean | max | mean | min | ||
A | C | ||||
bar | large | 5.500000 | 9 | 7.500000 | 6 |
small | 5.500000 | 9 | 8.500000 | 8 | |
foo | large | 2.000000 | 5 | 4.500000 | 4 |
small | 2.333333 | 6 | 4.333333 | 2 |
5.4.3.5. High-performance Pandas: eval() and query()#
5.4.3.5.1. eval()#
Evaluate a string describing operations on DataFrame columns.
Operates on columns only, not specific rows or elements. This allows eval
to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.
For example:
df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
df
A | B | |
---|---|---|
0 | 1 | 10 |
1 | 2 | 8 |
2 | 3 | 6 |
3 | 4 | 4 |
4 | 5 | 2 |
df.eval('A + B')
0 11
1 10
2 9
3 8
4 7
dtype: int64
The assignment is allowed though by default the original DataFrame
is not modified.
df.eval('C = A + B')
A | B | C | |
---|---|---|---|
0 | 1 | 10 | 11 |
1 | 2 | 8 | 10 |
2 | 3 | 6 | 9 |
3 | 4 | 4 | 8 |
4 | 5 | 2 | 7 |
df
A | B | |
---|---|---|
0 | 1 | 10 |
1 | 2 | 8 |
2 | 3 | 6 |
3 | 4 | 4 |
4 | 5 | 2 |
Use inplace=True
to modify the original DataFrame.
df.eval('C = A + B', inplace=True)
df
A | B | C | |
---|---|---|---|
0 | 1 | 10 | 11 |
1 | 2 | 8 | 10 |
2 | 3 | 6 | 9 |
3 | 4 | 4 | 8 |
4 | 5 | 2 | 7 |
Multiple columns can be assigned using multi-line expressions:
df.eval(
'''
C = A + B
D = A - B
'''
)
A | B | C | D | |
---|---|---|---|---|
0 | 1 | 10 | 11 | -9 |
1 | 2 | 8 | 10 | -6 |
2 | 3 | 6 | 9 | -3 |
3 | 4 | 4 | 8 | 0 |
4 | 5 | 2 | 7 | 3 |
5.4.3.5.2. query()#
Query the columns of a DataFrame with a boolean expression.
For example:
df = pd.DataFrame({
'A': range(1, 6),
'B': range(10, 0, -2),
'C C': range(10, 5, -1)
})
df
A | B | C C | |
---|---|---|---|
0 | 1 | 10 | 10 |
1 | 2 | 8 | 9 |
2 | 3 | 6 | 8 |
3 | 4 | 4 | 7 |
4 | 5 | 2 | 6 |
df.query('A > B')
A | B | C C | |
---|---|---|---|
4 | 5 | 2 | 6 |
The previous expression is equivalent to
df[df.A > df.B]
A | B | C C | |
---|---|---|---|
4 | 5 | 2 | 6 |
For columns with spaces in their name, you can use backtick quoting.
df.query('B == `C C`')
A | B | C C | |
---|---|---|---|
0 | 1 | 10 | 10 |
The previous expression is equivalent to
df[df.B == df['C C']]
A | B | C C | |
---|---|---|---|
0 | 1 | 10 | 10 |
from IPython.display import HTML
display(
HTML(
"""
<div class='full-width docutils' >
<div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
<p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
<div class="pandastutor inner" style="height:660px;">
<iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B%0A%20%20%20%20'A'%3A%20range%281,%206%29,%0A%20%20%20%20'B'%3A%20range%2810,%200,%20-2%29,%0A%20%20%20%20'C%20C'%3A%20range%2810,%205,%20-1%29%0A%7D%29%0A%0Adf%5Bdf.B%20%3D%3D%20df%5B'C%20C'%5D%5D&d=2023-07-13&lang=py&v=v1"> </iframe>
</div>
</div>
</div>
"""
)
)
Let's visualize it! 🎥
5.4.3.6. Your turn! 🚀#
5.4.3.6.1. Processing image data#
Recently, very powerful AI models have been developed that allow us to understand images. There are many tasks that can be solved using pre-trained neural networks, or cloud services. Some examples include:
Image Classification, can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as Custom Vision
Object Detection to detect different objects in the image. Services such as computer vision can detect a number of common objects, and you can train Custom Vision model to detect some specific objects of interest.
Face Detection, including Age, Gender and Emotion detection. This can be done via Face API.
All those cloud services can be called using Python SDKs, and thus can be easily incorporated into your data exploration workflow.
Here are some examples of exploring data from Image data sources:
In the blog post How to Learn Data Science without Coding we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using computer vision, and then use Azure Machine Learning AutoML to build the interpretable model.
In Facial Studies Workshop we use Face API to extract emotions from people on photographs from events, in order to try to understand what makes people happy.
5.4.3.6.2. Assignment#
5.4.3.7. Self study#
In this chapter, we’ve covered many of the basics of using Pandas effectively for data analysis. Still, much has been omitted from our discussion. To learn more about Pandas, we recommend the following resources:
Pandas online documentation: This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.
Python for Data Analysis Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).
Stack Overflow: Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you’re coming across-more than likely you’ll find your answer on a Stack Overflow page.
Pandas on PyVideo: From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.
Using these resources, combined with the walk-through given in this chapter, my hope is that you’ll be poised to use Pandas to tackle any data analysis problem you come across!
5.4.3.8. Acknowledgments#
Thanks for Pandas user guide. It contributes the majority of the content in this chapter.