42.48. Gradient descent#

42.48.1. Session Objective#

In previous sessions, we’ve delved into the application of Linear Regression and Logistic Regression models. You may find the code relatively straightforward and intuitive at this point. However, you might be pondering questions like:

  • What exactly occurs when we invoke the .fit() function?

  • Why does the execution of the .fit() function sometimes take a significant amount of time?

This session is designed to provide insight into the functionality of the .fit() method, which is responsible for training machine learning models and fine-tuning model parameters. The underlying technique at play here is known as “Gradient Descent.”

42.48.2. Let’s Explore and Gain Intuition#

To further enhance your understanding and gain a playful insight into Gradient Descent, you can explore the following resources:

  • Gradient Descent Visualization: This GitHub repository offers a visualization of the Gradient Descent algorithm, which can be a valuable resource for understanding the optimization process.

  • Optimization Algorithms Visualization: Explore this visualization to see how different optimization algorithms, including Gradient Descent, work and how they converge to find optimal solutions.

These resources will help you build an intuitive grasp of Gradient Descent and its role in training machine learning models. Enjoy your exploration!

42.48.2.1. Math (feel free to skip if you find it difficult)#

The fundamental concept behind gradient descent is rather straightforward: it involves the gradual adjustment of parameters, such as the slope (\(m\)) and the intercept (\(b\)) in our regression equation $y = mx + b, with the aim of minimizing a cost function. This cost function is typically a metric that quantifies the disparity between our model’s predicted results and the actual values. In regression scenarios, the widely employed cost function is the mean squared error (MSE). When dealing with classification problems, a different set of parameters must be fine-tuned.

The MSE (Mean Squared Error) is mathematically expressed as:

\[ MSE = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y_i})^2 \]

Here, \(y_i\) represents the actual data points, \(\hat{y_i}\) signifies the predictions made by our model (\(mx_i + b\)), and \(n\) denotes the total number of data points.

Our primary challenge is to determine the optimal adjustments to parameters \(m\) and $b” to minimize the MSE effectively.

42.48.2.2. Partial Derivatives#

In our pursuit of minimizing the Mean Squared Error (MSE), we turn to partial derivatives to understand how each individual parameter influences the MSE. The term “partial” signifies that we are taking derivatives with respect to individual parameters, in this case, \(m\) and $b, separately.

Consider the following formula, which closely resembles the MSE, but now we’ve introduced the function \(f(m, b)\) into it. The addition of this function doesn’t significantly alter the essence of the calculation, but it allows us to input specific values for \(m\) and \(b\) to compute the result.

\[f(m, b) = \frac{1}{n}\sum_{i=1}^{n}(y_i - (mx_i+b))^2\]

For the purposes of calculating partial derivatives, we can temporarily disregard the summation and the terms preceding it, focusing solely on the expression \(y - (mx + b)^2\). This expression serves as a better starting point for the subsequent partial derivative calculations.

42.48.2.3. Partial Derivative with Respect to \(m\)#

When we calculate the partial derivative with respect to the parameter \(m," we isolate the parameter \)m” and treat \(b\) as a constant (effectively setting it to 0 for differentiation purposes). To compute this derivative, we utilize the chain rule, which is a fundamental concept in calculus.

The chain rule is expressed as follows:

\[ [f(g(x))]' = f'(g(x)) * g(x)' \quad - \textrm{chain rule} \]

The chain rule is applicable when one function is nested inside another. In this context, the square operation, \(()^2\), is the outer function, while \(y - (mx + b)\) is the inner function. Following the chain rule, we differentiate the outer function, maintain the inner function as it is, and then multiply it by the derivative of the inner function. Let’s break down the steps:

\[ (y - (mx + b))^2 \]
  1. The derivative of \(()^2\) is \(2()\), just like \(x^2\) becomes \(2x\).

  2. We leave the inner function, \(y - (mx + b)\), unaltered.

  3. The derivative of \(y - (mx + b)\) with respect to m is \((0 - x)\), which simplifies to \(-x\). This is because both y and b are treated as constants (their derivatives are zero), and the derivative of mx is simply x.

Now, let’s combine these components:

\[ 2 \cdot (y - (mx+b)) \cdot (-x) \]

For clarity, we can rearrange this expression by moving the factor of \(-x\) to the left:

\[ -2x \cdot (y-(mx+b)) \]

This is the final version of our derivative with respect to \(m\):

\[ \frac{\partial f}{\partial m} = \frac{1}{n}\sum_{i=1}^{n} -2x_i(y_i - (mx_i+b)) \]

Here, \(\frac{df}{dm}\) signifies the partial derivative of function \(f\) (as previously defined) with respect to the parameter \(m\). We can now insert this derivative into our summation to complete the calculation.

42.48.2.4. Partial Derivative with Respect to \(b\)#

The process for computing the partial derivative with respect to the parameter \(b" is analogous to our previous derivation with respect to \)m. We still apply the same rules and utilize the chain rule:

  1. The derivative of \(()^2\) is \(2()\), which corresponds to how \(x^2\) becomes \(2x\).

  2. We leave the inner function, \(y - (mx + b)\), unaltered.

  3. For the derivative of \(y - (mx + b)\) with respect to b, it becomes \((0 - 1)\) or simply $-1.” This is because both y and mx are treated as constants (their derivatives are zero), and the derivative of b is 1.

Now, let’s consolidate these components:

\[ 2 \cdot (y - (mx+b)) \cdot (-1) \]

Simplifying this expression:

\[ -2 \cdot (y-(mx+b)) \]

This is the final version of our derivative with respect to \(b\):

\[ \frac{\partial f}{\partial b} = \frac{1}{n}\sum_{i=1}^{n} -2(y_i - (mx_i+b)) \]

Similarly to the previous case, \(\frac{df}{db}\) represents the partial derivative of function \(f\) with respect to the parameter $b”. Inserting this derivative into our summation concludes the computation.

42.48.2.5. The Final Function#

Before delving into the code, there are a few essential details to address:

  1. Gradient descent is an iterative process, and with each iteration (referred to as an “epoch”), we incrementally reduce the Mean Squared Error (MSE). At each iteration, we apply our derived functions to update the values of parameters \(m\) and \(b\).

  2. Because gradient descent is iterative, we must determine how many iterations to perform, or devise a mechanism to stop the algorithm when it approaches the minimum of the MSE. In essence, we continue iterations until the algorithm no longer improves the MSE, signifying that it has reached a minimum.

  3. An important parameter in gradient descent is the learning rate (\(lr\)). The learning rate governs the pace at which the algorithm moves toward the minimum of the MSE. A smaller learning rate results in slower but more precise convergence, while a larger learning rate may lead to faster convergence but may overshoot the minimum.

In summary, gradient descent primarily involves the process of taking derivatives and applying them iteratively to minimize a function. These derivatives guide us toward optimizing the parameters \(m\) and $b” in order to minimize the Mean Squared Error.

42.48.3. Time to code!#

%matplotlib inline

import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from mpl_toolkits.mplot3d import Axes3D

42.48.4. 1. One-dimensional gradient descent#

One-dimensional gradient descent is a simple optimization algorithm used to find the minimum (or maximum) of a univariate function.

Here are the main steps:

step 1: Choose an initial point on the function curve. For example, we choose an initial x value.

x = 10

step 2: Compute the derivative of the function at the current point. In this case, the derivative of the function is computed as follows:

def df(x):
    return 2 * x + 5

step 3: Update the position along the function curve using the gradient and the learning rate. The learning rate determines the size of the step.

learning_rate = 0.3
x -= learning_rate * df(x)

step 4: Repeat the process for a specified number of iterations.

max_iter = 100
for i in range(max_iter):
    gradient = df(x)
    x -= learning_rate * gradient

step 5: Output the result, which is the minimum value of x that corresponds to the minimum of the function.

print('x_min:', x)

42.48.4.1. Let’s try!#

# Define the target function
def f(x):
    return x ** 2 + 5 * x + 10

# Define the guidance of the target function
def df(x):
    return 2 * x + 5

x = np.linspace(-10, 10, 100)
y = f(x)
plt.plot(x, y, label='f(x)')
# Define gradient descent function
def gradient_descent(learning_rate, max_iter, x_init):
    # step 1:start with an initial value for the input variable, denoted as x_int.
    x = x_init
    x_history = [x]
    for i in range(max_iter):
        # step 2:compute the derivative of the function at the current point.
        gradient = df(x)
        x -= learning_rate * gradient
        # step 3:update the position along the function curve.
        x_history.append(x)
    # step 4:repeat the process.
    return x, x_history
# Set the learning rate and initial value
        # learning rate: the size of the step for gradient descent.
        # max_iter: number of gradient descent iterations.
learning_rate = 0.3
max_iter = 100
x_init = 10

# Run gradient descent function
x_min, x_history = gradient_descent(learning_rate, max_iter, x_init)

# Draw the target function and the process of gradient descent
x = np.linspace(-10, 10, 100)
y = f(x)
y_history = f(np.array(x_history))
plt.plot(x, y, label='f(x)')
plt.plot(x_history, y_history, 'ro-', label='Gradient Descent')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend()
plt.title('One-Dimensional Gradient Descent')
plt.show()

print('x_min:', x_min)

you can change learning rate and max_iter to see different images.

42.48.5. 2. Two-dimensional gradient descent#

Two-dimensional gradient descent is an optimization algorithm used to find the minimum (or maximum) of a function with two input variables.

Here are the main steps:

step 1: Choose an initial point on the function surface by specifying initial values for both x and y.

x_init = -5
y_init = 5

step 2: Compute the partial derivatives of the function at the current point. These partial derivatives represent the rate of change of the function with respect to x and y.

def dfx(x, y):
    return 2 * x + 5

def dfy(x, y):
    return 2 * y + 2

step 3: Update the positions along the function surface for both x and y using their respective partial derivatives and the learning rate.

learning_rate = 0.1
x -= learning_rate * dfx(x, y)
y -= learning_rate * dfy(x, y)

step 4: Repeat the process for a specified number of iterations.

max_iter = 100
for i in range(max_iter):
    gradient_x = dfx(x, y)
    gradient_y = dfy(x, y)
    x -= learning_rate * gradient_x
    y -= learning_rate * gradient_y

step 5: Output the results, which are the minimum values of x and y that correspond to the minimum of the function.

print('x_min:', x)
print('y_min:', y)

42.48.5.1. Let’s try!#

# Define the target function
def f(x, y):
    return x ** 2 + y ** 2 + 5 * x + 2 * y + 10

# Take the partial derivative of the objective function with respect to x
def dfx(x, y):
    return 2 * x + 5

# Take the partial derivative of the objective function with respect to y
def dfy(x, y):
    return 2 * y + 2
# Define the two-dimensional gradient descent function
def gradient_descent(learning_rate, max_iter, x_init, y_init):
    # step 1:start with an initial value for the two input variables, denoted as (x_init, y_init).
    x = x_init
    y = y_init
    x_history = [x]
    y_history = [y]
    for i in range(max_iter):
        # step 2:compute the partial derivatives of the function at the current point.
        gradient_x = dfx(x, y)
        gradient_y = dfy(x, y)
        x -= learning_rate * gradient_x
        y -= learning_rate * gradient_y
        # step 3:update the positions along the function surface.
        x_history.append(x)
        y_history.append(y)
    # step 4:repeat the process.
    return x, y, x_history, y_history
# Set the learning rate and initial value
learning_rate = 0.1
max_iter = 100
x_init = -5
y_init = 5

# Run two-dimensional gradient descent function
x_min, y_min, x_history, y_history = gradient_descent(learning_rate, max_iter, x_init, y_init)

# Draw the target function and the process of gradient descent
x = np.linspace(-10, 10, 100)
y = np.linspace(-10, 10, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)
ax.plot(x_history, y_history, f(np.array(x_history), np.array(y_history)), 'ro-', label='Gradient Descent')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('f(x, y)')
ax.legend()
ax.set_title('Two-Dimensional Gradient Descent')
plt.show()

print('x_min:', x_min)
print('y_min:', y_min)