Mastering Gradient Descent

Priyanka Dave

5 min readApr 26, 2024

A Comprehensive Guide to Optimizing Machine Learning Models

What is Gradient Descent?

Gradient descent is an optimization algorithm commonly used in machine learning.
Optimization algorithms are used to minimize the error or loss function to improve the model’s predictive accuracy.

What is Loss function?

Measure of the difference between the predicted values and actual values.

Let’s take regression example:

Machine learning (ML) and deep learning (DL) algorithms aim to minimize loss while enhancing accuracy.
The choice of loss function is critical as it directly impacts the performance of the machine learning model.
During training process, the optimization algorithm adjusts the model parameters iteratively to minimize the loss function.
Minimizing the loss function results in a model that makes more accurate predictions on unseen data.

Convex Vs. Non-convex Loss Functions:

1. Convex Loss Functions:

Convex loss functions are preferred because they have a unique global minimum, making optimization easier and more reliable.
Example: Mean Squared Error (MSE), Mean Absolute Error (MAE), Huber loss.

2. Non-convex Loss Functions:

Non-convex loss functions can have multiple local minima, making optimization more challenging because the optimizer might converge to a suboptimal solution depending on the initialization and optimization algorithm.
Example: Cross-entropy loss, Sigmoidal loss, Softmax loss

What is Convergence?

Convergence in gradient descent means that the algorithm’s iterations lead to parameter values (weights & bias) that either reach a local minimum, global minimum, or a stationary point of the loss function.

Understand the behavior of gradients:

Gradient = Slope = Derivative

The first derivative of the function f(x), which we write as f’(x).
The first derivative (gradient) tells us whether a function is increasing or decreasing, and by how much it is increasing or decreasing.

Positive gradient tells us that, as x increases, f(x) also increases.
Negative gradient tells us that, as x increases, f(x) decreases.

Zero gradient does not tell us anything in particular: the function may be increasing, decreasing, or at a local maximum or a local minimum at that point.

How Gradient Descent determines the direction to move?

For example:

Let’s say randomly we started with x = 0 — — — — — — → f’(x) = 2:

Since the gradient is positive (+2), it indicates that moving in the positive direction (increasing x) will increase the value of the loss function f(x)
Therefore, to minimize the loss function f(x), you should move in the opposite direction, which means decreasing x (-0.9,-0.8,-0.7…. etc).

Let’s say randomly we started with x = 1 — — — — — — → f’(x) = -1:

Since the derivative is negative (-1), it indicates that the loss function f(x) is decreasing at x = 1.
Therefore, to minimize the loss function f(x), you should move in the opposite direction, which means increasing x (1.1,1.2,1.3… etc).

How to update weights using Gradient Descent?

By subtracting a portion of the gradient from the current value of x (weight).

x(new) = x(old) − α * gradient
where α is the learning rate

Let’s consider previous example when we randomly started with x = 1, the gradient was (-1).
Let’s consider learning rate as 0.1
Now to find new value of x,

x(new) = 1 − (0.1* (-1)) — — — — — — → x(new) = 1.1

We will repeat same process for x = 1.1 and we will get f’(x) = -0.31

Choosing a learning rate that is neither too large nor too small is crucial for successful optimization with gradient descent.

The iteration process in Gradient Descent stops when the change in the loss function between iterations becomes negligible, indicating that further iterations are unlikely to significantly improve the solution.

A larger magnitude of the gradient suggests a steeper slope, indicating that small changes in the parameter values can lead to significant changes in the loss function.
If the magnitude of the gradient becomes very small, it suggests that the optimization process is close to convergence, as the slope of the loss function is becoming flatter.

Key takeaways:

Gradient descent is an optimization algorithm used to minimize a loss function in machine learning.
It iteratively adjusts the parameters of a model in the direction opposite to the gradient of the function.
The process continues until convergence criteria are met.
Choosing an appropriate learning rate is crucial for convergence.

You can explore examples in my GitHub repository for practical demonstrations here.