# Install the necessary dependencies

import os
import sys
import numpy as np
import pandas as pd
!{sys.executable} -m pip install --quiet jupyterlab_myst ipython

5.4.3. Advanced Pandas Techniques#

5.4.3.1. Overview#

In this section, we’ll continue to introduce combining datasets: concat, merge and join along with data aggregation and grouping.

5.4.3.2. Combining datasets: concat, merge and join#

5.4.3.2.1. concat#

Concatenate Pandas objects along a particular axis.
Allows optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

For example:

Combine two Series.

s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])

  a
  b
  c
  d
dtype: object

Clear the existing index and reset it in the result by setting the ignore_index option to True.

pd.concat([s1, s2], ignore_index=True)

  a
  b
  c
  d
dtype: object

Add a hierarchical index at the outermost level of the data with the keys option.

pd.concat([s1, s2], keys=['s1', 's2'])

s1  0    a
    1    b
s2  0    c
    1    d
dtype: object

Label the index keys you create with the names option.

pd.concat([s1, s2], keys=['s1', 's2'],
          names=['Series name', 'Row ID'])

Series name  Row ID
s1           0         a
             1         b
s2           0         c
             1         d
dtype: object

Combine two DataFrame objects with identical columns.

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
df1

	letter	number
0	a	1
1	b	2

df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])
df2

	letter	number
0	c	3
1	d	4

pd.concat([df1, df2])

	letter	number
0	a	1
1	b	2
0	c	3
1	d	4

Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values.

df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
                   columns=['letter', 'number', 'animal'])
df3

	letter	number	animal
0	c	3	cat
1	d	4	dog

pd.concat([df1, df3], sort=False)

	letter	number	animal
0	a	1	NaN
1	b	2	NaN
0	c	3	cat
1	d	4	dog

Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.

pd.concat([df1, df3], join="inner")

	letter	number
0	a	1
1	b	2
0	c	3
1	d	4

Combine DataFrame objects horizontally along the x-axis by passing in axis=1.

df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
                   columns=['animal', 'name'])
pd.concat([df1, df4], axis=1)

	letter	number	animal	name
0	a	1	bird	polly
1	b	2	monkey	george

Prevent the result from including duplicate index values with the verify_integrity option.

df5 = pd.DataFrame([1], index=['a'])
df5

	0
a	1

df6 = pd.DataFrame([2], index=['a'])
df6

	0
a	2

pd.concat([df5, df6], verify_integrity=True)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 1
----> 1 pd.concat([df5, df6], verify_integrity=True)

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:368, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    146 @deprecate_nonkeyword_arguments(version=None, allowed_args=["objs"])
    147 def concat(
    148     objs: Iterable[NDFrame] | Mapping[HashableT, NDFrame],
   (...)
    157     copy: bool = True,
    158 ) -> DataFrame | Series:
    159     """
    160     Concatenate pandas objects along a particular axis.
    161 
   (...)
    366     1   3   4
    367     """
--> 368     op = _Concatenator(
    369         objs,
    370         axis=axis,
    371         ignore_index=ignore_index,
    372         join=join,
    373         keys=keys,
    374         levels=levels,
    375         names=names,
    376         verify_integrity=verify_integrity,
    377         copy=copy,
    378         sort=sort,
    379     )
    381     return op.get_result()

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:563, in _Concatenator.__init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    560 self.verify_integrity = verify_integrity
    561 self.copy = copy
--> 563 self.new_axes = self._get_new_axes()

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:633, in _Concatenator._get_new_axes(self)
    631 def _get_new_axes(self) -> list[Index]:
    632     ndim = self._get_result_dim()
--> 633     return [
    634         self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
    635         for i in range(ndim)
    636     ]

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:634, in <listcomp>(.0)
    631 def _get_new_axes(self) -> list[Index]:
    632     ndim = self._get_result_dim()
    633     return [
--> 634         self._get_concat_axis if i == self.bm_axis else self._get_comb_axis(i)
    635         for i in range(ndim)
    636     ]

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/_libs/properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:697, in _Concatenator._get_concat_axis(self)
    692 else:
    693     concat_axis = _make_concat_multiindex(
    694         indexes, self.keys, self.levels, self.names
    695     )
--> 697 self._maybe_check_integrity(concat_axis)
    699 return concat_axis

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/concat.py:705, in _Concatenator._maybe_check_integrity(self, concat_index)
    703 if not concat_index.is_unique:
    704     overlap = concat_index[concat_index.duplicated()].unique()
--> 705     raise ValueError(f"Indexes have overlapping values: {overlap}")

ValueError: Indexes have overlapping values: Index(['a'], dtype='object')

Append a single row to the end of a DataFrame object.

df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0])
df7

	a	b
0	1	2

new_row = pd.Series({'a': 3, 'b': 4})
new_row

a    3
b    4
dtype: int64

pd.concat([df7, new_row.to_frame().T], ignore_index=True)

	a	b
0	1	2
1	3	4

Note

append() has been deprecated since version 1.4.0: Use concat() instead.

5.4.3.2.2. merge#

Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross-merge, no column specifications to merge on are allowed.

If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.

For example:

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.

df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=('_left', '_right'))

	lkey	value_left	rkey	value_right
0	foo	1	foo	5
1	foo	1	foo	8
2	foo	5	foo	5
3	foo	5	foo	8
4	bar	2	bar	6
5	baz	3	baz	7

from IPython.display import HTML

display(
    HTML(
        """


<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:800px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf1%20%3D%20pd.DataFrame%28%7B'lkey'%3A%20%5B'foo',%20'bar',%20'baz',%20'foo'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'value'%3A%20%5B1,%202,%203,%205%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'rkey'%3A%20%5B'foo',%20'bar',%20'baz',%20'foo'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'value'%3A%20%5B5,%206,%207,%208%5D%7D%29%0Adf1.merge%28df2,%20left_on%3D'lkey',%20right_on%3D'rkey'%29&d=2023-05-27&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>


"""
    )
)

Let's visualize it! 🎥

Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.

df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[22], line 1
----> 1 df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/frame.py:10090, in DataFrame.merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
@Substitution("")
@Appender(_merge_doc, indents=2)
def merge(
   (...)
   validate: str | None = None,
) -> DataFrame:
   from pandas.core.reshape.merge import merge
> 10090     return merge(
       self,
       right,
       how=how,
       on=on,
       left_on=left_on,
       right_on=right_on,
       left_index=left_index,
       right_index=right_index,
       sort=sort,
       suffixes=suffixes,
       copy=copy,
       indicator=indicator,
       validate=validate,
   )

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/merge.py:124, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
@Substitution("\nleft : DataFrame or named Series")
@Appender(_merge_doc, indents=0)
def merge(
   (...)
   validate: str | None = None,
) -> DataFrame:
   op = _MergeOperation(
       left,
       right,
   (...)
       validate=validate,
   )
--> 124     return op.get_result(copy=copy)

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/merge.py:775, in _MergeOperation.get_result(self, copy)
   self.left, self.right = self._indicator_pre_merge(self.left, self.right)
join_index, left_indexer, right_indexer = self._get_join_info()
--> 775 result = self._reindex_and_concat(
   join_index, left_indexer, right_indexer, copy=copy
)
result = result.__finalize__(self, method=self._merge_type)
if self.indicator:

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/merge.py:729, in _MergeOperation._reindex_and_concat(self, join_index, left_indexer, right_indexer, copy)
left = self.left[:]
right = self.right[:]
--> 729 llabels, rlabels = _items_overlap_with_suffix(
   self.left._info_axis, self.right._info_axis, self.suffixes
)
if left_indexer is not None:
   # Pinning the index here (and in the right code just below) is not
   #  necessary, but makes the `.take` more performant if we have e.g.
   #  a MultiIndex for left.index.
   lmgr = left._mgr.reindex_indexer(
       join_index,
       left_indexer,
   (...)
       use_na_proxy=True,
   )

File /usr/share/miniconda/envs/open-machine-learning-jupyter-book/lib/python3.9/site-packages/pandas/core/reshape/merge.py:2458, in _items_overlap_with_suffix(left, right, suffixes)
lsuffix, rsuffix = suffixes
if not lsuffix and not rsuffix:
-> 2458     raise ValueError(f"columns overlap but no suffix specified: {to_rename}")
def renamer(x, suffix):
   """
   Rename the left and right indices.

   (...)
   x : renamed column name
   """

ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Using how parameter decide the type of merge to be performed.

df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})

df1.merge(df2, how='inner', on='a')

	a	b	c
0	foo	1	3

from IPython.display import HTML

display(
    HTML(
        """


<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:600px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'bar'%5D,%20'b'%3A%20%5B1,%202%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'baz'%5D,%20'c'%3A%20%5B3,%204%5D%7D%29%0Adf1.merge%28df2,%20how%3D'inner',%20on%3D'a'%29&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>

"""
    )
)

Let's visualize it! 🎥

df1.merge(df2, how='left', on='a')

	a	b	c
0	foo	1	3.0
1	bar	2	NaN

from IPython.display import HTML

display(
    HTML(
        """


<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:630px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'bar'%5D,%20'b'%3A%20%5B1,%202%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'a'%3A%20%5B'foo',%20'baz'%5D,%20'c'%3A%20%5B3,%204%5D%7D%29%0Adf1.merge%28df2,%20how%3D'left',%20on%3D'a'%29&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>
"""
    )
)

Let's visualize it! 🎥

df1 = pd.DataFrame({'left': ['foo', 'bar']})
df2 = pd.DataFrame({'right': [7, 8]})

df1.merge(df2, how='cross')

	left	right
0	foo	7
1	foo	8
2	bar	7
3	bar	8

from IPython.display import HTML

display(
    HTML(
        """


<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:670px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf1%20%3D%20pd.DataFrame%28%7B'left'%3A%20%5B'foo',%20'bar'%5D%7D%29%0Adf2%20%3D%20pd.DataFrame%28%7B'right'%3A%20%5B7,%208%5D%7D%29%0Adf1.merge%28df2,%20how%3D'cross'%29&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>




"""
    )
)

Let's visualize it! 🎥

5.4.3.2.3. join#

Join columns of another DataFrame.
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

For example:

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})                     

Join DataFrames using their indexes.

df.join(other, lsuffix='_caller', rsuffix='_other')

	key_caller	A	key_other	B
0	K0	A0	K0	B0
1	K1	A1	K1	B1
2	K2	A2	K2	B2
3	K3	A3	NaN	NaN
4	K4	A4	NaN	NaN
5	K5	A5	NaN	NaN

from IPython.display import HTML

display(
    HTML(
        """


<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:830px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Adf%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2',%20'K3',%20'K4',%20'K5'%5D,'A'%3A%20%5B'A0',%20'A1',%20'A2',%20'A3',%20'A4',%20'A5'%5D%7D%29%0Aother%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2'%5D,'B'%3A%20%5B'B0',%20'B1',%20'B2'%5D%7D%29%0Adf.join%28other,%20lsuffix%3D'_caller',%20rsuffix%3D'_other'%29&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>




"""
    )
)

Let's visualize it! 🎥

If we want to join using the key columns, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.

df.set_index('key').join(other.set_index('key'))

	A	B
key
K0	A0	B0
K1	A1	B1
K2	A2	B2
K3	A3	NaN
K4	A4	NaN
K5	A5	NaN

from IPython.display import HTML

display(
    HTML(
        """


<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:1170px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2',%20'K3',%20'K4',%20'K5'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'A'%3A%20%5B'A0',%20'A1',%20'A2',%20'A3',%20'A4',%20'A5'%5D%7D%29%0Aother%20%3D%20pd.DataFrame%28%7B'key'%3A%20%5B'K0',%20'K1',%20'K2'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'B'%3A%20%5B'B0',%20'B1',%20'B2'%5D%7D%29%0Adf.set_index%28'key'%29.join%28other.set_index%28'key'%29%29&d=2023-05-27&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>




"""
    )
)

Let's visualize it! 🎥

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.

df.join(other.set_index('key'), on='key')

	key	A	B
0	K0	A0	B0
1	K1	A1	B1
2	K2	A2	B2
3	K3	A3	NaN
4	K4	A4	NaN
5	K5	A5	NaN

Using non-unique key values shows how they are matched.

df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],
                   'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
df                   

	key	A
0	K0	A0
1	K1	A1
2	K1	A2
3	K3	A3
4	K0	A4
5	K1	A5

df.join(other.set_index('key'), on='key', validate='m:1')

	key	A	B
0	K0	A0	B0
1	K1	A1	B1
2	K1	A2	B1
3	K3	A3	NaN
4	K0	A4	B0
5	K1	A5	B1

5.4.3.3. Aggregation and grouping#

Group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

For example:

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df
df.groupby(['Animal']).mean()

	Max Speed
Animal
Falcon	375.0
Parrot	25.0

from IPython.display import HTML

display(
    HTML(
        """

<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:785px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B'Animal'%3A%20%5B'Falcon',%20'Falcon',%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20'Max%20Speed'%3A%20%5B380.,%20370.,%2024.,%2026.%5D%7D%29%0Adf%0Adf.groupby%28%5B'Animal'%5D%29.mean%28%29&d=2023-05-27&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>




"""
    )
)

Let's visualize it! 🎥

5.4.3.3.1. Hierarchical Indexes#

We can groupby different levels of a hierarchical index using the level parameter:

arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
          ['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
                  index=index)
df.groupby(level=0).mean()

	Max Speed
Animal
Falcon	370.0
Parrot	25.0

from IPython.display import HTML

display(
    HTML(
        """

<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:810px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aarrays%20%3D%20%5B%5B'Falcon',%20'Falcon',%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%5B'Captive',%20'Wild',%20'Captive',%20'Wild'%5D%5D%0Aindex%20%3D%20pd.MultiIndex.from_arrays%28arrays,%20names%3D%28'Animal',%20'Type'%29%29%0Adf%20%3D%20pd.DataFrame%28%7B'Max%20Speed'%3A%20%5B390.,%20350.,%2030.,%2020.%5D%7D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20index%3Dindex%29%0Adf.groupby%28level%3D0%29.mean%28%29&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>



"""
    )
)

Let's visualize it! 🎥

df.groupby(level="Type").mean()

	Max Speed
Type
Captive	210.0
Wild	185.0

from IPython.display import HTML

display(
    HTML(
        """

<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:810px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aarrays%20%3D%20%5B%5B'Falcon',%20'Falcon',%20'Parrot',%20'Parrot'%5D,%0A%20%20%20%20%20%20%20%20%20%20%5B'Captive',%20'Wild',%20'Captive',%20'Wild'%5D%5D%0Aindex%20%3D%20pd.MultiIndex.from_arrays%28arrays,%20names%3D%28'Animal',%20'Type'%29%29%0Adf%20%3D%20pd.DataFrame%28%7B'Max%20Speed'%3A%20%5B390.,%20350.,%2030.,%2020.%5D%7D,%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20index%3Dindex%29%0Adf.groupby%28level%3D%22Type%22%29.mean%28%29&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>



"""
    )
)

Let's visualize it! 🎥

We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True.

l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df.groupby(by=["b"]).sum()

	a	c
b
1.0	2	3
2.0	2	5

from IPython.display import HTML

display(
    HTML(
        """

<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:770px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B1,%202,%203%5D,%20%5B1,%20None,%204%5D,%20%5B2,%201,%203%5D,%20%5B1,%202,%202%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%5B%22b%22%5D%29.sum%28%29%0A&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>


"""
    )
)

Let's visualize it! 🎥

df.groupby(by=["b"], dropna=False).sum()

	a	c
b
1.0	2	3
2.0	2	5
NaN	1	4

from IPython.display import HTML

display(
    HTML(
        """

<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:750px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Al%20%3D%20%5B%5B1,%202,%203%5D,%20%5B1,%20None,%204%5D,%20%5B2,%201,%203%5D,%20%5B1,%202,%202%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%5B%22b%22%5D,%20dropna%3DFalse%29.sum%28%29&d=2023-05-27&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>

"""
    )
)

Let's visualize it! 🎥

l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
df = pd.DataFrame(l, columns=["a", "b", "c"])
df.groupby(by="a").sum()

	b	c
a
a	13.0	13.0
b	12.3	123.0

from IPython.display import HTML

display(
    HTML(
        """

<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:750px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B%22a%22,%2012,%2012%5D,%20%5BNone,%2012.3,%2033.%5D,%20%5B%22b%22,%2012.3,%20123%5D,%20%5B%22a%22,%201,%201%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%22a%22%29.sum%28%29&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>



"""
    )
)

Let's visualize it! 🎥

df.groupby(by="a", dropna=False).sum()

	b	c
a
a	13.0	13.0
b	12.3	123.0
NaN	12.3	33.0

from IPython.display import HTML

display(
    HTML(
        """
<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:750px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Al%20%3D%20%5B%5B%22a%22,%2012,%2012%5D,%20%5BNone,%2012.3,%2033.%5D,%20%5B%22b%22,%2012.3,%20123%5D,%20%5B%22a%22,%201,%201%5D%5D%0Adf%20%3D%20pd.DataFrame%28l,%20columns%3D%5B%22a%22,%20%22b%22,%20%22c%22%5D%29%0Adf.groupby%28by%3D%22a%22,%20dropna%3DFalse%29.sum%28%29%0A&d=2023-07-15&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>



"""
    )
)

Let's visualize it! 🎥

When using .apply(), use group_keys to include or exclude the group keys. The group_keys argument defaults to True (include).

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df.groupby("Animal", group_keys=True).apply(lambda x: x)

		Animal	Max Speed
Animal
Falcon	0	Falcon	380.0
Falcon	1	Falcon	370.0
Parrot	2	Parrot	24.0
Parrot	3	Parrot	26.0

df.groupby("Animal", group_keys=False).apply(lambda x: x)

	Animal	Max Speed
0	Falcon	380.0
1	Falcon	370.0
2	Parrot	24.0
3	Parrot	26.0

5.4.3.4. Pivot table#

Create a spreadsheet-style pivot table as a DataFrame.

The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large"],
                   "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                   "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
df

	A	B	C	D	E
0	foo	one	small	1	2
1	foo	one	large	2	4
2	foo	one	large	2	5
3	foo	two	small	3	5
4	foo	two	small	3	6
5	bar	one	large	4	6
6	bar	one	small	5	8
7	bar	two	small	6	9
8	bar	two	large	7	9

This first example aggregates values by taking the sum.

table = pd.pivot_table(df, values='D', index=['A', 'B'],
                    columns=['C'], aggfunc=np.sum)
table

	C	large	small
A	B
bar	one	4.0	5.0
bar	two	7.0	6.0
foo	one	4.0	1.0
foo	two	NaN	6.0

We can also fill in missing values using the fill_value parameter.

table = pd.pivot_table(df, values='D', index=['A', 'B'],
                    columns=['C'], aggfunc=np.sum, fill_value=0)
table

	C	large	small
A	B
bar	one	4	5
bar	two	7	6
foo	one	4	1
foo	two	0	6

The next example aggregates by taking the mean across multiple columns.

table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
                    aggfunc={'D': np.mean,
                             'E': np.mean})
table

		D	E
A	C
bar	large	5.500000	7.500000
bar	small	5.500000	8.500000
foo	large	2.000000	4.500000
foo	small	2.333333	4.333333

We can also calculate multiple types of aggregations for any given value column.

table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
                    aggfunc={'D': np.mean,
                             'E': [min, max, np.mean]})
table

		D	E
		mean	max	mean	min
A	C
bar	large	5.500000	9	7.500000	6
bar	small	5.500000	9	8.500000	8
foo	large	2.000000	5	4.500000	4
foo	small	2.333333	6	4.333333	2

5.4.3.5. High-performance Pandas: eval() and query()#

5.4.3.5.1. eval()#

Evaluate a string describing operations on DataFrame columns.

Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.

For example:

df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
df

	A	B
0	1	10
1	2	8
2	3	6
3	4	4
4	5	2

df.eval('A + B')

  11
  10
   9
   8
   7
dtype: int64

The assignment is allowed though by default the original DataFrame is not modified.

df.eval('C = A + B')

	A	B	C
0	1	10	11
1	2	8	10
2	3	6	9
3	4	4	8
4	5	2	7

df

	A	B
0	1	10
1	2	8
2	3	6
3	4	4
4	5	2

Use inplace=True to modify the original DataFrame.

df.eval('C = A + B', inplace=True)
df

	A	B	C
0	1	10	11
1	2	8	10
2	3	6	9
3	4	4	8
4	5	2	7

Multiple columns can be assigned using multi-line expressions:

df.eval(
    '''
    C = A + B
    D = A - B
    '''
)

	A	B	C	D
0	1	10	11	-9
1	2	8	10	-6
2	3	6	9	-3
3	4	4	8	0
4	5	2	7	3

5.4.3.5.2. query()#

Query the columns of a DataFrame with a boolean expression.

For example:

df = pd.DataFrame({
    'A': range(1, 6),
    'B': range(10, 0, -2),
    'C C': range(10, 5, -1)
})
df

	A	B	C C
0	1	10	10
1	2	8	9
2	3	6	8
3	4	4	7
4	5	2	6

df.query('A > B')

	A	B	C C
4	5	2	6

The previous expression is equivalent to

df[df.A > df.B]

	A	B	C C
4	5	2	6

For columns with spaces in their name, you can use backtick quoting.

df.query('B == `C C`')

	A	B	C C
0	1	10	10

The previous expression is equivalent to

df[df.B == df['C C']]

	A	B	C C
0	1	10	10

from IPython.display import HTML

display(
    HTML(
        """
<div class='full-width docutils' >
  <div class="admonition note pandastutor" name="html-admonition" style="margin-right:20%">
    <p class="admonition-title pandastutor">Let's visualize it! 🎥</p>
    <div class="pandastutor inner" style="height:660px;">
      <iframe frameborder="0" scrolling="no" src="https://pandastutor.com/vis.html#code=import%20pandas%20as%20pd%0Aimport%20io%0Aimport%20numpy%20as%20np%0Adf%20%3D%20pd.DataFrame%28%7B%0A%20%20%20%20'A'%3A%20range%281,%206%29,%0A%20%20%20%20'B'%3A%20range%2810,%200,%20-2%29,%0A%20%20%20%20'C%20C'%3A%20range%2810,%205,%20-1%29%0A%7D%29%0A%0Adf%5Bdf.B%20%3D%3D%20df%5B'C%20C'%5D%5D&d=2023-07-13&lang=py&v=v1"> </iframe>
    </div>
  </div>
</div>


"""
    )
)

Let's visualize it! 🎥

5.4.3.6. Your turn! 🚀#

5.4.3.6.1. Processing image data#

Recently, very powerful AI models have been developed that allow us to understand images. There are many tasks that can be solved using pre-trained neural networks, or cloud services. Some examples include:

Image Classification, can help you categorize the image into one of the pre-defined classes. You can easily train your own image classifiers using services such as Custom Vision
Object Detection to detect different objects in the image. Services such as computer vision can detect a number of common objects, and you can train Custom Vision model to detect some specific objects of interest.
Face Detection, including Age, Gender and Emotion detection. This can be done via Face API.

All those cloud services can be called using Python SDKs, and thus can be easily incorporated into your data exploration workflow.

Here are some examples of exploring data from Image data sources:

In the blog post How to Learn Data Science without Coding we explore Instagram photos, trying to understand what makes people give more likes to a photo. We first extract as much information from pictures as possible using computer vision, and then use Azure Machine Learning AutoML to build the interpretable model.
In Facial Studies Workshop we use Face API to extract emotions from people on photographs from events, in order to try to understand what makes people happy.

5.4.3.6.2. Assignment#

Perform more detailed data study for the challenges above

5.4.3.7. Self study#

In this chapter, we’ve covered many of the basics of using Pandas effectively for data analysis. Still, much has been omitted from our discussion. To learn more about Pandas, we recommend the following resources:

Pandas online documentation: This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.
Python for Data Analysis Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).
Stack Overflow: Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you’re coming across-more than likely you’ll find your answer on a Stack Overflow page.
Pandas on PyVideo: From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.

Using these resources, combined with the walk-through given in this chapter, my hope is that you’ll be poised to use Pandas to tackle any data analysis problem you come across!

5.4.3.8. Acknowledgments#

Thanks for Pandas user guide. It contributes the majority of the content in this chapter.

Ocademy Open Machine Learning Book

Advanced Pandas Techniques

Contents

5.4.3. Advanced Pandas Techniques#

5.4.3.1. Overview#

5.4.3.2. Combining datasets: concat, merge and join#

5.4.3.2.1. concat#

5.4.3.2.2. merge#

5.4.3.2.3. join#

5.4.3.3. Aggregation and grouping#

5.4.3.3.1. Hierarchical Indexes#

5.4.3.4. Pivot table#

5.4.3.5. High-performance Pandas: eval() and query()#

5.4.3.5.1. eval()#

5.4.3.5.2. query()#

5.4.3.6. Your turn! 🚀#

5.4.3.6.1. Processing image data#

5.4.3.6.2. Assignment#

5.4.3.7. Self study#

5.4.3.8. Acknowledgments#

	key_caller	A	key_other	B
0	K0	A0	K0	B0
1	K1	A1	K1	B1
2	K2	A2	K2	B2
3	K3	A3	NaN	NaN
4	K4	A4	NaN	NaN
5	K5	A5	NaN	NaN

	key_caller	A	key_other	B
0	K0	A0	K0	B0
1	K1	A1	K1	B1
2	K2	A2	K2	B2
3	K3	A3	NaN	NaN
4	K4	A4	NaN	NaN
5	K5	A5	NaN	NaN

	key_caller	A	key_other	B
0	K0	A0	K0	B0
1	K1	A1	K1	B1
2	K2	A2	K2	B2
3	K3	A3	NaN	NaN
4	K4	A4	NaN	NaN
5	K5	A5	NaN	NaN