42.19. Introduction to probability and statistics#

In this notebook, we will play around with some of the concepts we have previously discussed. Many concepts from probability and statistics are well-represented in major libraries for data processing in Python, such as numpy and pandas.

# install the necessary dependencies
import sys
!{sys.executable} - m pip install - -quiet numpy pandas matplotlib scipy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

42.19.1. Random Variables and Distributions#

Let’s start with drawing a sample of 30 values from a uniform distribution from \(0\) to \(9\). We will also compute mean and variance.

sample = np.random.rand(1, 3000)
print(f"Sample: {sample}")
print(f"Mean = {np.mean(sample)}")
print(f"Variance = {np.var(sample)}")

To visually estimate how many different values are there in the sample, we can plot the histogram:

plt.hist(sample[0])
plt.show()

42.19.2. Analyzing Real Data#

Mean and variance are very important when analyzing real-world data. Let’s load the data about baseball players from SOCR MLB Height/Weight Data

df = pd.read_csv("https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/assets/data/SOCR_MLB.tsv",
                 sep='\t', header=None, names=['Name', 'Team', 'Role', 'Height', 'Weight', 'Age'])
df

We are using a package called Pandas here for data analysis. We will talk more about Pandas and working with data in Python later in this course.

Let’s compute average values for age, height and weight:

df[['Age', 'Height', 'Weight']].mean()

Now let’s focus on height, and compute standard deviation and variance:

print(list(df['Height'])[:20])
mean = df['Height'].mean()
var = df['Height'].var()
std = df['Height'].std()
print(f"Mean = {mean}\nVariance = {var}\nStandard Deviation = {std}")

In addition to mean, it makes sense to look at the median value and quartiles. They can be visualized using a box plot:

plt.figure(figsize=(10, 2))
plt.boxplot(df['Height'], vert=False, showmeans=True)
plt.grid(color='gray', linestyle='dotted')
plt.tight_layout()
plt.show()

We can also make box plots of subsets of our dataset, for example, grouped by player role.

df.boxplot(column='Height', by='Role', figsize=(10, 8))
plt.xticks(rotation='vertical')
plt.tight_layout()
plt.show()

Note: This diagram suggests, that on average, the heights of first basemen are higher than heights of second basemen. Later we will learn how we can test this hypothesis more formally, and how to demonstrate that our data is statistically significant to show that.

Age, height and weight are all continuous random variables. What do you think their distribution is? A good way to find out is to plot the histogram of values:

df['Height'].hist(bins=15, figsize=(10, 6))
plt.suptitle('Height distribution of MLB Players')
plt.xlabel('Height')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

42.19.3. Normal distribution#

Let’s create an artificial sample of weights that follows a normal distribution with the same mean and variance as our real data:

generated = np.random.normal(mean, std, 1000)
generated[:20]
plt.figure(figsize=(10, 6))
plt.hist(generated, bins=15)
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 6))
plt.hist(np.random.normal(0, 1, 50000), bins=300)
plt.tight_layout()
plt.show()

Since most values in real life are normally distributed, we should not use a uniform random number generator to generate sample data. Here is what happens if we try to generate weights with a uniform distribution (generated by np.random.rand):

wrong_sample = np.random.rand(1000)*2*std+mean-std
plt.figure(figsize=(10, 6))
plt.hist(wrong_sample)
plt.tight_layout()
plt.show()

42.19.4. Confidence intervals#

Let’s now calculate confidence intervals for the weights and heights of baseball players. We will use the code from this stackoverflow discussion:

import scipy.stats


def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2., n-1)
    return m, h


for p in [0.85, 0.9, 0.95]:
    m, h = mean_confidence_interval(df['Height'].fillna(method='pad'), p)
    print(f"p={p:.2f}, mean = {m:.2f} ± {h:.2f}")

42.19.5. Hypothesis testing#

Let’s explore different roles in our baseball players dataset:

df.groupby('Role').agg(
    {
        'Height': 'mean',
        'Weight': 'mean',
        'Age': 'count'
    }
).rename(columns={'Age': 'Count'})

Let’s test the hypothesis that First Basemen are taller than Second Basemen. The simplest way to do this is to test the confidence intervals:

for p in [0.85, 0.9, 0.95]:
    m1, h1 = mean_confidence_interval(
        df.loc[df['Role'] == 'First_Baseman', ['Height']], p)
    m2, h2 = mean_confidence_interval(
        df.loc[df['Role'] == 'Second_Baseman', ['Height']], p)
    print(
        f'Conf={p:.2f}, 1st basemen height: {m1-h1[0]:.2f}..{m1+h1[0]:.2f}, 2nd basemen height: {m2-h2[0]:.2f}..{m2+h2[0]:.2f}')

We can see that the intervals do not overlap.

A statistically more correct way to prove the hypothesis is to use a Student t-test:

from scipy.stats import ttest_ind

tval, pval = ttest_ind(df.loc[
    df['Role'] == 'First_Baseman',
    ['Height']
], df.loc[df['Role'] == 'Second_Baseman', ['Height']], equal_var=False)
print(f"T-value = {tval[0]:.2f}\nP-value: {pval[0]}")

The two values returned by the ttest_ind function are:

  • p-value can be considered as the probability of two distributions having the same mean. In our case, it is very low, meaning that there is strong evidence supporting that first basemen are taller.

  • t-value is the intermediate value of normalized mean difference that is used in the t-test, and it is compared against a threshold value for a given confidence value.

42.19.6. Simulating a normal distribution with the central limit theorem#

The pseudo-random generator in Python is designed to give us a uniform distribution. If we want to create a generator for normal distribution, we can use the central limit theorem. To get a normally distributed value we will just compute a mean of a uniform-generated sample.

def normal_random(sample_size=100):
    sample = [random.uniform(0, 1) for _ in range(sample_size)]
    return sum(sample)/sample_size


sample = [normal_random() for _ in range(100)]
plt.figure(figsize=(10, 6))
plt.hist(sample)
plt.tight_layout()
plt.show()

42.19.7. Correlation and evil baseball corp#

Correlation allows us to find relations between data sequences. In our toy example, let’s pretend there is an evil baseball corporation that pays its players according to their height - the taller the player is, the more money he/she gets. Suppose there is a base salary of \(1000, and an additional bonus from \)0 to $100, depending on height. We will take the real players from MLB, and compute their imaginary salaries:

heights = df['Height']
salaries = 1000 + (heights - heights.min()) / \
    (heights.max() - heights.mean()) * 100
print(list(zip(heights, salaries))[:10])

Let’s now compute covariance and correlation of those sequences. np.cov will give us a so-called covariance matrix, which is an extension of covariance to multiple variables. The element \(M_{ij}\) of the covariance matrix \(M\) is a correlation between input variables \(X_i\) and \(X_j\), and diagonal values \(M_{ii}\) is the variance of \(X_{i}\). Similarly, np.corrcoef will give us the correlation matrix.

print(f"Covariance matrix:\n{np.cov(heights, salaries)}")
print(f"Covariance = {np.cov(heights, salaries)[0,1]}")
print(f"Correlation = {np.corrcoef(heights, salaries)[0,1]}")

A correlation equal to 1 means that there is a strong linear relation between two variables. We can visually see the linear relation by plotting one value against the other:

plt.figure(figsize=(10, 6))
plt.scatter(heights, salaries)
plt.tight_layout()
plt.show()

Let’s see what happens if the relation is not linear. Suppose that our corporation decided to hide the obvious linear dependency between heights and salaries, and introduced some non-linearity into the formula, such as sin:

salaries = 1000 + np.sin(
    (heights - heights.min()) / (heights.max() - heights.mean())) * 100
print(f"Correlation = {np.corrcoef(heights, salaries)[0, 1]}")

In this case, the correlation is slightly smaller, but it is still quite high. Now, to make the relation even less obvious, we might want to add some extra randomness by adding some random variable to the salary. Let’s see what happens:

salaries = 1000+np.sin((heights - heights.min()) / (heights.max() -
                       heights.mean())) * 100 + np.random.random(size=len(heights)) * 20 - 10
print(f"Correlation = {np.corrcoef(heights, salaries)[0, 1]}")
plt.figure(figsize=(10, 6))
plt.scatter(heights, salaries)
plt.tight_layout()
plt.show()

Can you guess why the dots line up into vertical lines like this?

We have observed the correlation between an artificially engineered concept like salary and the observed variable height. Let’s also see if the two observed variables, such as height and weight, correlate too:

np.corrcoef(df['Height'], df['Weight'])

Unfortunately, we did not get any results - only some strange nan values. This is due to the fact that some of the values in our series are undefined, represented as nan, which causes the result of the operation to be undefined as well. By looking at the matrix we can see that Weight is the problematic column, because self-correlation between Height values has been computed.

This example shows the importance of data preparation and cleaning. Without proper data we cannot compute anything.

Let’s use fillna method to fill the missing values, and compute the correlation:

np.corrcoef(df['Height'], df['Weight'].fillna(method='pad'))

There is indeed a correlation, but not such a strong one as in our artificial example. Indeed, if we look at the scatter plot of one value against the other, the relation would be much less obvious:

plt.figure(figsize=(10, 6))
plt.scatter(df['Height'], df['Weight'])
plt.xlabel('Height')
plt.ylabel('Weight')
plt.tight_layout()
plt.show()

42.19.8. Conclusion#

In this notebook we have learnt how to perform basic operations on data to compute statistical functions. We now know how to use a sound apparatus of math and statistics in order to prove some hypotheses, and how to compute confidence intervals for arbitrary variables given a data sample.

42.19.9. Acknowledgments#

Thanks to Microsoft for creating the open-source course Data Science for Beginners. It inspires the majority of the content in this chapter.