Table of Contents
- Definitions
- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-Batch Gradient Descent
- Advanced Gradient Descent Variants
Short Summary: Gradient descent is a widely used optimization algorithm for efficiently training machine learning models. There are three primary categories of gradient descent: Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.
Definitions
- \(\theta\) - Model parameters.
- \(J(\theta)\) - Loss function.
- \(\bigtriangledown_\theta J(\theta)\) - Gradient of the losss function w.r.t. the parameters \(\theta\).
- \(\alpha\) - Learning rate.
- \(n\) - Mini-batch size.
Batch Gradient Descent
- The entire dataset is used for updating the model parameters.
- Does not allow for online learning (i.e., on-the-fly updates with new samples).
- Slow and problematic if the entire dataset does not fit into memory.
- Converges to the global minimum for convex functions and to a local minimum for non-convex functions for "reasonable" learning rates.
\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta) \end{equation}
Stochastic Gradient Descent
- A single \((x, y)\) feature-label pair is used for updating the model parameters.
- Allows for online training and faster than Batch Gradient Descent (fits into memory).
- Loss function fluctuates heavily with each iteration resulting in high variance updates.
- Might fail to converge exactly due to fluctuations, but has a chance of jumping out of a local minimum and converging to a better one.
\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta;{x^{(i)}};{y^{(i)}}) \end{equation}
Mini-Batch Gradient Descent
- A mini-batch (usually from (n = 8) to (n = 512)) is used for updating the model parameters.
- Allows for online training and is fast as it leverages highly optimized tensor operations.
- Reduces the fluctuations of Stochastic Gradient Descent.
- Does not guarantee good convergence out-of-the-box and utilizes advanced techniques.
\begin{equation} \theta = \theta - \alpha \cdot \bigtriangledown_\theta J(\theta;{x^{(i:i+n)}};{y^{(i:i+n)}}) \end{equation}
Advanced Gradient Descent Variants
Momentum
- Helps getting out of plateaus and ravines, resulting in getting stuck in the local minima.
- Accelerates in the relevant direction and reduces oscillations in the others.
- The coefficient \(\beta\) is in the \((0, 1)\) interval.
\begin{align} v_t &= {\beta \cdot v_{t-1}} + \alpha \cdot \bigtriangledown_\theta J(\theta)\\ \theta &= \theta - v_t \end{align}
Nesterov Accelerated Gradient (NAG)
- Momentum could gain "too much speed" in one direction and overshoot past the minimum (on that dimension).
- Slows down before the slope changes Need to "understand" its direction and the path ahead.
- Approximates the next position of the parameters on the function Calculate gradient on that approximation of the parameters and no the current ones.
- Approximates the next position of the parameters.
- The gradient is missing for the full update.
\begin{align} v_t &= \beta \cdot v_{t-1} + \alpha \cdot \bigtriangledown_\theta J(\theta - {\beta \cdot v_{t-1}})\\ \theta &= \theta - v_t \end{align}
Adagrad, Adadelta, and RMSProp
- Problem: All parameters share the same learning rate (not ideal for sparse data).
- Use different learning rates for each parameters.
- Performs smaller updates for frequent parameters.
- Performs larger updates for infrequent parameters.
- Allows for better convergence.
- For the most part, eliminates the need for tuning the learning rate.
- \(G_t\) is the sum of the gradients.
\begin{equation} \theta_i = \theta_i - \frac{\alpha}{G_t} \cdot \bigtriangledown_\theta J(\theta_i) \end{equation}
Adam, AdaMax, and Nadam
- Adam (Adaptive Moment Estimation) is RMSProp + Momentum.
- AdaMax (Stable Adam) solves some of the numerical stability and bias issues in Adam.
- Uses \(L_{\infty}\) norm.
- Nadam (Nesterov-accelerated Adaptive Moment Estimation) is Adam + NAG.