Short Summary: Exploding and vanishing gradients is a phenomenon that typically occurs in Deep Neural Networks (DNN) with many layers. They occur primarily due to gradient accumulation, which happens through a chain of multiplications (i.e., chain product).
Addressing Exploding and Vanishing Gradients
- Gradient Clipping
- Clip at a certain threshold
- Activation Function
- Use \(\text{ReLU}(x) = \text{max}(0, x)\) or \(\text{Swish}(x) = x * \sigma(x)\) instead of \(\sigma(x) = \dfrac{1}{1 + e^x}\) or \(\text{tanh}(x) = \dfrac{e^x - e^{-x}}{e^{x} + e^{-x}}\)
- Skip Connections or Residual Connections
- Skip Connections: allow information to bypass several intermediate layers
- Residual Connections: allow information to bypass several layers and be added directly to the output of the network
- Both approaches reduce gradient accumulation
- Batch Normalization
- Normalizes the values and reduces the risk of exploding or vanishing gradients
- Model architecture
- Maybe less layers (i.e., less deep network) would work better as less layers means smaller number of multiplications and consequently, reduces the risk of vanishing/exploding gradient
- Lowering learning rate
- Can help reduce the risk of vanishing and exploding gradients, but network will take longer to learn. But typically, we use optimizers like AdamW that do not need manual learning rate tuning, so this option goes last.