Understanding Gradient Descent

Gradient descent is one of the most fundamental optimization algorithms in machine learning. Let’s dive into how it works mathematically.

The Core Concept

The goal of gradient descent is to minimize a cost function $J(\theta)$ by iteratively moving in the direction of steepest descent.

The update rule is:

\theta_{new} = \theta_{old} - \alpha \nabla J(\theta)

Where:

$\theta$ are the parameters we’re optimizing
$\alpha$ is the learning rate
$\nabla J(\theta)$ is the gradient of the cost function

Example: Linear Regression

For linear regression, our hypothesis is:

h_\theta(x) = \theta_0 + \theta_1 x

And our cost function (Mean Squared Error) is:

J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2

The partial derivatives are:

\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})

\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x^{(i)}

Implementation

Here’s a simple Python implementation:

def gradient_descent(X, y, theta, alpha, iterations):
    m = len(y)

    for _ in range(iterations):
        # Compute predictions
        predictions = X.dot(theta)

        # Compute errors
        errors = predictions - y

        # Update parameters
        theta = theta - (alpha / m) * X.T.dot(errors)

    return theta

Learning Rate Considerations

The learning rate $\alpha$ is crucial:

Too large: May overshoot the minimum and diverge
Too small: Convergence will be very slow
Just right: Efficient convergence to the minimum

Finding the right learning rate often requires experimentation!

The Core Concept

Example: Linear Regression

Implementation

Learning Rate Considerations

Comments