Understanding Gradient Descent

A mathematical introduction to gradient descent optimization algorithm

Gradient descent is one of the most fundamental optimization algorithms in machine learning. Let’s dive into how it works mathematically.

The Core Concept

The goal of gradient descent is to minimize a cost function J(θ)J(\theta) by iteratively moving in the direction of steepest descent.

The update rule is:

θnew=θoldαJ(θ)\theta_{new} = \theta_{old} - \alpha \nabla J(\theta)

Where:

  • θ\theta are the parameters we’re optimizing
  • α\alpha is the learning rate
  • J(θ)\nabla J(\theta) is the gradient of the cost function

Example: Linear Regression

For linear regression, our hypothesis is:

hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x

And our cost function (Mean Squared Error) is:

J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2

The partial derivatives are:

Jθ0=1mi=1m(hθ(x(i))y(i))\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) Jθ1=1mi=1m(hθ(x(i))y(i))x(i)\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)}

Implementation

Here’s a simple Python implementation:

def gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)

    for _ in range(iterations):
        # Compute predictions
        predictions = X.dot(theta)

        # Compute errors
        errors = predictions - y

        # Update parameters
        theta = theta - (alpha / m) * X.T.dot(errors)

    return theta

Learning Rate Considerations

The learning rate α\alpha is crucial:

  • Too large: May overshoot the minimum and diverge
  • Too small: Convergence will be very slow
  • Just right: Efficient convergence to the minimum

Finding the right learning rate often requires experimentation!

Comments