In [1]:
%%html
<!-- The customized css for the slides -->
<link rel="stylesheet" type="text/css" href="../../assets/styles/basic.css"/>
<link rel="stylesheet" type="text/css" href="../../assets/styles/python-programming-basic.css"/>

In [None]:
# Install the necessary dependencies

import os
import sys
!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython

---
license:
    code: MIT
    content: CC-BY-4.0
github: https://github.com/ocademy-ai/machine-learning
venue: By Ocademy
open_access: true
bibliography:
  - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib
---

# NumPy and Pandas

## 1. NumPy

Numpy is a library for working with tensors, i.e. multi-dimensional arrays. Array has values of the same underlying type, and it is simpler than dataframe, but it offers more mathematical operations, and creates less overhead.

<div class="alignRight">
  <img class="removeMargin" height="100" src="../images/numpy.png"/>
</div>

### The basics of NumPy arrays

Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas are built around the NumPy array.

Weâ€™ll cover a few categories of basic array manipulations here:

- Attributes of arrays: Determining the size, shape, memory consumption, and data types of arrays.
- Indexing of arrays: Getting and setting the value of individual array elements.
- Slicing of arrays: Getting and setting smaller subarrays within a larger array.
- Reshaping of arrays: Changing the shape of a given array Joining and splitting of arrays: Combining multiple arrays into one, and splitting one array into many.

#### NumPy array attributes

First letâ€™s discuss some useful array attributes. Weâ€™ll start by defining three random arrays, a one-dimensional, two-dimensional, and three-dimensional array.

In [2]:
import numpy as np
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

In [3]:
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


In [4]:
print("dtype:", x3.dtype)

dtype: int64


In [5]:
print("itemsize:", x3.itemsize, "bytes")
print("nbytes:", x3.nbytes, "bytes")

itemsize: 8 bytes
nbytes: 480 bytes


### Computation on NumPy arrays: universal functions

Computation on NumPy arrays can be very fast, or it can be very slow. The key to making it fast is to use vectorized operations, generally implemented through NumPyâ€™s universal functions (ufuncs).

In [6]:
import numpy as np
np.random.seed(0)


def compute_reciprocals(values):
    output = np.empty(len(values))
    for i in range(len(values)):
        output[i] = 1.0 / values[i]
    return output


values = np.random.randint(1, 10, size=5)
compute_reciprocals(values)

array([0.16666667, 1.        , 0.25      , 0.25      , 0.125     ])

In [7]:
big_array = np.random.randint(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)

1.87 s Â± 19.2 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)


In [8]:
print(compute_reciprocals(values))
print(1.0 / values)

[0.16666667 1.         0.25       0.25       0.125     ]
[0.16666667 1.         0.25       0.25       0.125     ]


In [9]:
%timeit (1.0 / big_array)

606 Âµs Â± 16.3 Âµs per loop (mean Â± std. dev. of 7 runs, 1,000 loops each)


### Exploring NumPyâ€™s ufuncs

In [10]:
x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division

x     = [0 1 2 3]
x + 5 = [5 6 7 8]
x - 5 = [-5 -4 -3 -2]
x * 2 = [0 2 4 6]
x / 2 = [0.  0.5 1.  1.5]
x // 2 = [0 0 1 1]


### Aggregations: min, max, and everything in between

Often when faced with a large amount of data, a first step is to compute summary statistics for the data in question.

In [11]:
L = np.random.random(100)
sum(L)

50.461758453195614

In [12]:
np.sum(L)

50.46175845319564

In [13]:
big_array = np.random.rand(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)

74.9 ms Â± 979 Âµs per loop (mean Â± std. dev. of 7 runs, 10 loops each)
193 Âµs Â± 1.54 Âµs per loop (mean Â± std. dev. of 7 runs, 10,000 loops each)


#### Minimum and maximum

In [14]:
min(big_array), max(big_array)

(7.071203171893359e-07, 0.9999997207656334)

In [15]:
np.min(big_array), np.max(big_array)

(7.071203171893359e-07, 0.9999997207656334)

In [16]:
%timeit min(big_array)
%timeit np.min(big_array)

54.5 ms Â± 1.2 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)
175 Âµs Â± 2.59 Âµs per loop (mean Â± std. dev. of 7 runs, 10,000 loops each)


In [17]:
print(big_array.min(), big_array.max(), big_array.sum())

7.071203171893359e-07 0.9999997207656334 500216.8034810001


### Computation on arrays: broadcasting

Another means of vectorizing operations is to use NumPyâ€™s broadcasting functionality. Broadcasting is simply a set of rules for applying binary ufuncs (e.g., addition, subtraction, multiplication, etc.) on arrays of different sizes.

In [18]:
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b

array([5, 6, 7])

In [19]:
a + 5

array([5, 6, 7])

In [20]:
M = np.ones((3, 3))
M

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

In [21]:
M + a

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

### Structured data: NumPyâ€™s structured arrays

While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPyâ€™s structured arrays and record arrays

In [22]:
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]

In [23]:
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={
    'names': ('name', 'age', 'weight'),
    'formats': ('U10', 'i4', 'f8')
})
print(data.dtype)

[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]


In [24]:
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)

[('Alice', 25, 55. ) ('Bob', 45, 85.5) ('Cathy', 37, 68. )
 ('Doug', 19, 61.5)]


In [25]:
# Get all names
data['name']

array(['Alice', 'Bob', 'Cathy', 'Doug'], dtype='<U10')

In [26]:
# Get first row of data
data[0]

('Alice', 25, 55.)

In [27]:
# Get the name from the last row
data[-1]['name']

'Doug'

In [28]:
# Get names where age is under 30
data[data['age'] < 30]['name']

array(['Alice', 'Doug'], dtype='<U10')

This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas.

## 2. Pandas

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language.

<div class="alignRight">
  <img class="removeMargin" height="100" src="../images/pandas.png"/>
</div>

### Introducing Pandas objects

At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.

#### The Pandas Series object

In [3]:
import pandas as pd

data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [30]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [31]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [32]:
data[1]
data[1:3]

1    0.50
2    0.75
dtype: float64

### Data indexing and selection

In Pandas, we looked in detail at methods and tools to access, set, and modify values in NumPy arrays. These included indexing (e.g., `arr[2, 1]`), slicing (e.g., `arr[:, 1:5]`), masking (e.g., `arr[arr > 0]`), fancy indexing (e.g., `arr[0, [1, 5]]`), and combinations thereof (e.g., `arr[:, [1, 5]]`).

#### Data selection in Series

In [33]:
data = pd.Series(
    [0.25, 0.5, 0.75, 1.0],
    index=['a', 'b', 'c', 'd']
)
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [34]:
data['b']

0.5

In [35]:
'a' in data

True

In [36]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [37]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [38]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [39]:
data.iloc[1:3]

3    b
5    c
dtype: object

In [40]:
data.loc[1]

'a'

#### Data selection in DataFrame

Recall that a `DataFrame` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of `Series` structures sharing the same index. 

In [41]:
area = pd.Series({
    'California': 423967, 'Texas': 695662,
    'New York': 141297, 'Florida': 170312,
    'Illinois': 149995
})
pop = pd.Series({
    'California': 38332521, 'Texas': 26448193,
    'New York': 19651127, 'Florida': 19552860,
    'Illinois': 12882135
})
data = pd.DataFrame({'area': area, 'pop': pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [42]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [43]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [44]:
data.area is data['area']

True

#### DataFrame as two-dimensional array

As mentioned previously, we can also view the `DataFrame` as an enhanced two-dimensional array. We can examine the raw underlying data array using the `values` attribute:

In [45]:
data.values

array([[  423967, 38332521],
       [  695662, 26448193],
       [  141297, 19651127],
       [  170312, 19552860],
       [  149995, 12882135]])

In [46]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967,695662,141297,170312,149995
pop,38332521,26448193,19651127,19552860,12882135


In [47]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [48]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


### Operating on data in Pandas

Pandas inherits much of this functionality from NumPy, and the ufuncs are key to this.

#### Ufuncs: index preservation

In [49]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

0    6
1    3
2    7
3    4
dtype: int64

In [50]:
df = pd.DataFrame(
    rng.randint(0, 10, (3, 4)),
    columns=['A', 'B', 'C', 'D']
)
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


In [51]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

### Handling missing data

Real-world data is rarely clean and homogeneous.

Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point `NaN` value, and the Python `None` object.

#### `None`: pythonic missing data

In [52]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

This `dtype=object` means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.

In [53]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
50.6 ms Â± 449 Âµs per loop (mean Â± std. dev. of 7 runs, 10 loops each)

dtype = int
589 Âµs Â± 5.26 Âµs per loop (mean Â± std. dev. of 7 runs, 1,000 loops each)



In [54]:
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

#### NaN: missing numerical data

In [56]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. 

In [57]:
%timeit np.arange(1E6, dtype=dtype).sum()
print()

590 Âµs Â± 7.91 Âµs per loop (mean Â± std. dev. of 7 runs, 1,000 loops each)



In [58]:
vals2.sum()

nan

#### NaN and None in Pandas

In [59]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [60]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int64

In [61]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

#### Operating on null values

In [11]:
data = pd.Series([1, np.nan, 'hello', None])

In [12]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [64]:
data[data.notnull()]

0        1
2    hello
dtype: object

In [65]:
data.dropna()

0        1
2    hello
dtype: object

In [4]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data


a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [10]:
data = data.fillna(0)
data

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

### Combining datasets: concat and append

Some of the most interesting studies of data come from combining different data sources.

In [68]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)


# example DataFrame
make_df('ABC', range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


#### Simple concatenation with pd.concat

Pandas has a function, `pd.concat()`, which has a similar syntax to `np.concatenate` but contains a number of options that weâ€™ll discuss momentarily:

```
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
          keys=None, levels=None, names=None, verify_integrity=False,
          copy=True)
```

In [69]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser1

1    A
2    B
3    C
dtype: object

In [70]:
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
ser2

4    D
5    E
6    F
dtype: object

In [71]:
pd.concat([ser1, ser2])

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

### Combining datasets: merge and join

One essential feature offered by Pandas is its high-performance, in-memory join and merge operations.

The `pd.merge()` function implements a number of types of joins: the *one-to-one*, *many-to-one*, and *many-to-many* joins.

In [72]:
df1 = pd.DataFrame({
    'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']
})
df1

Unnamed: 0,employee,group
0,Bob,Accounting
1,Jake,Engineering
2,Lisa,Engineering
3,Sue,HR


In [2]:
df2 = pd.DataFrame({
    'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
    'hire_date': [2004, 2008, 2012, 2014]
})
df2

Unnamed: 0,employee,hire_date
0,Lisa,2004
1,Bob,2008
2,Jake,2012
3,Sue,2014


In [74]:
df3 = pd.merge(df1, df2)
df3

Unnamed: 0,employee,group,hire_date
0,Bob,Accounting,2008
1,Jake,Engineering,2012
2,Lisa,Engineering,2004
3,Sue,HR,2014


## 3. Your turn! ðŸš€

[Data processing in Python](../../data-science/working-with-data/numpy.html#your-turn)


## 4. References

1. [Working with data - Numpy](https://ocademy-ai.github.io/machine-learning/data-science/working-with-data/numpy.html)
2. [Working with data - Pandas](https://ocademy-ai.github.io/machine-learning/data-science/working-with-data/pandas.html)