%%html
<!-- The customized css for the slides -->
<link rel="stylesheet" type="text/css" href="../../assets/styles/basic.css"/>
<link rel="stylesheet" type="text/css" href="../../assets/styles/python-programming-basic.css"/>
# Install the necessary dependencies

import os
import sys
!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython

43.6. NumPy and Pandas#

43.6.1. 1. NumPy#

Numpy is a library for working with tensors, i.e. multi-dimensional arrays. Array has values of the same underlying type, and it is simpler than dataframe, but it offers more mathematical operations, and creates less overhead.

43.6.1.1. The basics of NumPy arrays#

Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array.

We’ll cover a few categories of basic array manipulations here:

  • Attributes of arrays: Determining the size, shape, memory consumption, and data types of arrays.

  • Indexing of arrays: Getting and setting the value of individual array elements.

  • Slicing of arrays: Getting and setting smaller subarrays within a larger array.

  • Reshaping of arrays: Changing the shape of a given array Joining and splitting of arrays: Combining multiple arrays into one, and splitting one array into many.

43.6.1.1.1. NumPy array attributes#

First let’s discuss some useful array attributes. We’ll start by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array.

import numpy as np
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60
print("dtype:", x3.dtype)
dtype: int64
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")
itemsize: 8 bytes
nbytes: 480 bytes

43.6.1.2. Computation on NumPy arrays: universal functions#

Computation on NumPy arrays can be very fast, or it can be very slow. The key to making it fast is to use vectorized operations, generally implemented through NumPy’s universal functions (ufuncs).

import numpy as np
np.random.seed(0)


def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output


values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)
array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)
1.87 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(compute_reciprocals(values))
print(1.0 / values)
[0.16666667 1.         0.25       0.25       0.125     ]
[0.16666667 1.         0.25       0.25       0.125     ]
%timeit (1.0 / big_array)
606 µs ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

43.6.1.3. Exploring NumPy’s ufuncs#

x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division
x     = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]

43.6.1.4. Aggregations: min, max, and everything in between#

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question.

L = np.random.random(100)
sum(L)
50.461758453195614
np.sum(L)
50.46175845319564
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)
74.9 ms ± 979 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
193 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

43.6.1.4.1. Minimum and maximum#

min(big_array), max(big_array)
(7.071203171893359e-07, 0.9999997207656334)
np.min(big_array), np.max(big_array)
(7.071203171893359e-07, 0.9999997207656334)
%timeit min(big_array)
%timeit np.min(big_array)
54.5 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
175 µs ± 2.59 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
print(big_array.min(), big_array.max(), big_array.sum())
7.071203171893359e-07 0.9999997207656334 500216.8034810001

43.6.1.5. Computation on arrays: broadcasting#

Another means of vectorizing operations is to use NumPy’s broadcasting functionality. Broadcasting is simply a set of rules for applying binary ufuncs (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes.

a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b
array([5, 6, 7])
a + 5
array([5, 6, 7])
M = np.ones((3, 3))
M
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])
M + a
array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

43.6.1.6. Structured data: NumPy’s structured arrays#

While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy’s structured arrays and record arrays

name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={
    'names': ('name', 'age', 'weight'),
    'formats': ('U10', 'i4', 'f8')
})
print(data.dtype)
[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]
# Get all names
data['name']
array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')
# Get first row of data
data[0]
('Alice', 25, 55.)
# Get the name from the last row
data[-1]['name']
'Doug'
# Get names where age is under 30
data[data['age'] < 30]['name']
array(['Alice', 'Doug'], dtype='<U10')

This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas.

43.6.2. 2. Pandas#

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

43.6.2.1. Introducing Pandas objects#

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

43.6.2.1.1. The Pandas Series object#

import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
data.values
array([0.25, 0.5 , 0.75, 1.  ])
data.index
RangeIndex(start=0, stop=4, step=1)
data[1]
data[1:3]
1    0.50
2    0.75
dtype: float64

43.6.2.2. Data indexing and selection#

In Pandas, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays. These included indexing (e.g., arr[2, 1]), slicing (e.g., arr[:, 1:5]), masking (e.g., arr[arr > 0]), fancy indexing (e.g., arr[0, [1, 5]]), and combinations thereof (e.g., arr[:, [1, 5]]).

43.6.2.2.1. Data selection in Series#

data = pd.Series(
    [0.25, 0.5, 0.75, 1.0],
    index=['a', 'b', 'c', 'd']
)
data
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
data['b']
0.5
'a' in data
True
data.keys()
Index(['a', 'b', 'c', 'd'], dtype='object')
list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
1    a
3    b
5    c
dtype: object
data.iloc[1:3]
3    b
5    c
dtype: object
data.loc[1]
'a'

43.6.2.2.2. Data selection in DataFrame#

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index.

area = pd.Series({
    'California': 423967, 'Texas': 695662,
    'New York': 141297, 'Florida': 170312,
    'Illinois': 149995
})
pop = pd.Series({
    'California': 38332521, 'Texas': 26448193,
    'New York': 19651127, 'Florida': 19552860,
    'Illinois': 12882135
})
data = pd.DataFrame({'area': area, 'pop': pop})
data
area pop
California 423967 38332521
Texas 695662 26448193
New York 141297 19651127
Florida 170312 19552860
Illinois 149995 12882135
data['area']
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
data.area
California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64
data.area is data['area']
True

43.6.2.2.3. DataFrame as two-dimensional array#

As mentioned previously, we can also view the DataFrame as an enhanced two-dimensional array. We can examine the raw underlying data array using the values attribute:

data.values
array([[  423967, 38332521],
       [  695662, 26448193],
       [  141297, 19651127],
       [  170312, 19552860],
       [  149995, 12882135]])
data.T
California Texas New York Florida Illinois
area 423967 695662 141297 170312 149995
pop 38332521 26448193 19651127 19552860 12882135
data.iloc[:3, :2]
area pop
California 423967 38332521
Texas 695662 26448193
New York 141297 19651127
data.loc[:'Illinois', :'pop']
area pop
California 423967 38332521
Texas 695662 26448193
New York 141297 19651127
Florida 170312 19552860
Illinois 149995 12882135

43.6.2.3. Operating on data in Pandas#

Pandas inherits much of this functionality from NumPy, and the ufuncs are key to this.

43.6.2.3.1. Ufuncs: index preservation#

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
0    6
1    3
2    7
3    4
dtype: int64
df = pd.DataFrame(
    rng.randint(0, 10, (3, 4)),
    columns=['A', 'B', 'C', 'D']
)
df
A B C D
0 6 9 2 6
1 7 4 3 7
2 7 2 5 4
np.exp(ser)
0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

43.6.2.4. Handling missing data#

Real-world data is rarely clean and homogeneous.

Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point NaN value, and the Python None object.

43.6.2.4.1. None: pythonic missing data#

vals1 = np.array([1, None, 3, 4])
vals1
array([1, None, 3, 4], dtype=object)

This dtype=object means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.

for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()
dtype = object
50.6 ms ± 449 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
589 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
vals1.sum()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/Users/zhangqi/ws/machine-learning/open-machine-learning-jupyter-book/slides/python-programming/python-in-data-science.ipynb Cell 79 in <cell line: 1>()
----> <a href='vscode-notebook-cell:/Users/zhangqi/ws/machine-learning/open-machine-learning-jupyter-book/slides/python-programming/python-in-data-science.ipynb#Y212sZmlsZQ%3D%3D?line=0'>1</a> vals1.sum()

File /usr/local/lib/python3.9/site-packages/numpy/core/_methods.py:48, in _sum(a, axis, dtype, out, keepdims, initial, where)
     46 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     47          initial=_NoValue, where=True):
---> 48     return umr_sum(a, axis, dtype, out, keepdims, initial, where)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

43.6.2.4.2. NaN: missing numerical data#

vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype
dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code.

%timeit np.arange(1E6, dtype=dtype).sum()
print()
590 µs ± 7.91 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
vals2.sum()
nan

43.6.2.4.3. NaN and None in Pandas#

pd.Series([1, np.nan, 2, None])
0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64
x = pd.Series(range(2), dtype=int)
x
0    0
1    1
dtype: int64
x[0] = None
x
0    NaN
1    1.0
dtype: float64

43.6.2.4.4. Operating on null values#

data = pd.Series([1, np.nan, 'hello', None])
data.isnull()
0    False
1     True
2    False
3     True
dtype: bool
data[data.notnull()]
0        1
2    hello
dtype: object
data.dropna()
0        1
2    hello
dtype: object
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64
data = data.fillna(0)
data
a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

43.6.2.5. Combining datasets: concat and append#

Some of the most interesting studies of data come from combining different data sources.

def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)


# example DataFrame
make_df('ABC', range(3))
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2

43.6.2.5.1. Simple concatenation with pd.concat#

Pandas has a function, pd.concat(), which has a similar syntax to np.concatenate but contains a number of options that we’ll discuss momentarily:

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser1
1    A
2    B
3    C
dtype: object
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
ser2
4    D
5    E
6    F
dtype: object
pd.concat([ser1, ser2])
1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

43.6.2.6. Combining datasets: merge and join#

One essential feature offered by Pandas is its high-performance, in-memory join and merge operations.

The pd.merge() function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins.

df1 = pd.DataFrame({
    'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']
})
df1
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
df2 = pd.DataFrame({
    'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
    'hire_date': [2004, 2008, 2012, 2014]
})
df2
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
df3 = pd.merge(df1, df2)
df3
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014

43.6.3. 3. Your turn! 🚀#

Data processing in Python

43.6.4. 4. References#

  1. Working with data - Numpy

  2. Working with data - Pandas