What Are Vanishing and Exploding Gradients?
1. Vanishing Gradient
2. Exploding Gradient
Mathematical Principles Behind Vanishing and Exploding Gradients
Vanishing and Exploding Gradients in CNNs
Vanishing and Exploding Gradients in RNNs
Practical Tips
Conclusion
References

In deep learning, vanishing gradients and exploding gradients are two common and challenging issues, especially prominent when training deep neural networks such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). This post explains in detail the principles, effects, and mitigation strategies for these phenomena in CNNs and RNNs, complemented with mathematical formulas to enhance understanding.

What Are Vanishing and Exploding Gradients?

During neural network training, backpropagation is used to compute gradients of the loss function with respect to parameters at each layer. These gradients are used to update parameters and minimize the loss. However, when a network becomes very deep or has a complex structure, gradients may shrink toward zero (vanishing gradients) or grow uncontrollably large (exploding gradients) as they propagate through the layers.

Vanishing Gradient

A vanishing gradient refers to a situation where, during backpropagation, gradients gradually shrink as they pass through layers, resulting in parameters near the input layer barely being updated. This prevents the network from learning effective representations, causing training stagnation.

Exploding Gradient

Exploding gradients occur when gradients grow larger as they propagate backward, eventually becoming excessively large. This leads to unstable parameter updates and even numerical overflow, making training infeasible.

Mathematical Principles Behind Vanishing and Exploding Gradients

Consider a deep neural network. The gradient of the loss function with respect to the parameters of layer , denoted , can be expressed as:

where is the activation of layer . We can see that the gradient is the product of derivatives between layers. If the magnitudes of these derivatives are less than 1, the product tends toward zero as depth increases (vanishing gradient). Conversely, if the derivatives are greater than 1, the product increases rapidly (exploding gradient). This is the mathematical root of these problems.

Vanishing and Exploding Gradients in CNNs

Introduction to Convolutional Neural Networks (CNNs)

CNNs are primarily used for grid-like data such as images. Their core component is the convolutional layer, which extracts local features using filters and captures higher-level representations by stacking multiple layers.

Gradient Issues in CNNs

Although CNNs face fewer gradient problems than RNNs due to their local connectivity and absence of long-term dependencies, very deep CNNs can still suffer from vanishing gradients. For example, classic VGG networks contain more than ten convolutional layers and may encounter gradient vanishing in training.

Mitigation Strategies

Using Appropriate Activation Functions
ReLU and its variants help alleviate vanishing gradients because their derivative is 1 in the positive region.
Weight Initialization
Proper initialization (He or Xavier) maintains stable activation and gradient variance across layers.
- Xavier Initialization (for Sigmoid/Tanh):
- He Initialization (for ReLU):
Batch Normalization (BN)
BN normalizes layer inputs to keep activations within a reasonable range, stabilizing gradient flow.

where and are the mean and variance of the batch data, respectively, and is a small constant to prevent division by zero.
Residual Connections
Skip connections allow gradients to flow directly, greatly reducing vanishing gradients.

where is a combination of certain layers and is the input.

Vanishing and Exploding Gradients in RNNs

Introduction to Recurrent Neural Networks (RNNs)

RNNs are designed for sequential data such as natural language and time series. Their recurrent structure allows accumulation of temporal information but also makes them prone to vanishing and exploding gradients, especially over long sequences.

Gradient Issues in RNNs

In standard RNNs, the hidden state is computed as:

Backpropagation Through Time (BPTT) involves multiplying gradients across many time steps, causing them to decrease or increase exponentially.

Mitigation Strategies

Gradient Clipping
Limit the gradient norm to prevent exploding gradients. For example, when the L2 norm of the gradient exceeds a certain threshold, scale it to that threshold.
Using More Advanced RNN Units
LSTM and GRU use gating mechanisms to control information flow and preserve long-term dependencies.
- LSTM unit contains input gate, forget gate, and output gate, with the following equations:

Through these gates, LSTM can effectively maintain long-term dependencies and reduce the risk of vanishing gradients.

Weight Initialization and Regularization
Appropriate initialization and regularization help stabilize training.
Alternative Activation Functions
Although less common, some RNN variants use ReLU or other activations to help gradient propagation.

Practical Tips

Choose LSTM or GRU over standard RNNs for long-term dependency tasks.
Initialize weights according to the activation function (He or Xavier).
Use Batch Normalization in CNNs to improve stability and speed.
Apply gradient clipping in RNN training to prevent explosions.
Regularly monitor gradient distributions to detect problems early.

Conclusion

Vanishing and exploding gradients are fundamental issues in deep learning. Understanding their origins and mitigation strategies is essential for building stable and effective neural networks. Through careful architecture design, activation selection, initialization, normalization, and gradient clipping, one can greatly reduce these problems and improve model performance. This article aims to help you better understand and address gradient issues to build more robust deep learning models.

References

Understanding the difficulty of training deep feedforward neural networks — Glorot & Bengio
Deep Residual Learning for Image Recognition — He et al.
Long Short-Term Memory — Hochreiter & Schmidhuber

Pokemon (Part 1) Practical Dataset - Building a Custom Pokemon Dataset with PyTorch

Vanishing and Exploding Gradients

What Are Vanishing and Exploding Gradients?

Vanishing Gradient

Exploding Gradient

Mathematical Principles Behind Vanishing and Exploding Gradients

Vanishing and Exploding Gradients in CNNs

Introduction to Convolutional Neural Networks (CNNs)

Gradient Issues in CNNs

Mitigation Strategies

Vanishing and Exploding Gradients in RNNs

Introduction to Recurrent Neural Networks (RNNs)

Gradient Issues in RNNs

Mitigation Strategies

Practical Tips

Conclusion

References

Long Short-Term Memory Network (LSTM)

Pokemon (Part 1) Practical Dataset - Building a Custom Pokemon Dataset with PyTorch

Reconstructing the MNIST Dataset Using VAE

Variational Autoencoder

Unsupervised Learning and Autoencoders

Latest Posts

Prisma - Filtering

Prisma - Transaction Handling — Batch and Interactive

Vanishing and Exploding Gradients

What Are Vanishing and Exploding Gradients?

Vanishing Gradient

Exploding Gradient

Mathematical Principles Behind Vanishing and Exploding Gradients

Vanishing and Exploding Gradients in CNNs

Introduction to Convolutional Neural Networks (CNNs)

Gradient Issues in CNNs

Mitigation Strategies

Vanishing and Exploding Gradients in RNNs

Introduction to Recurrent Neural Networks (RNNs)

Gradient Issues in RNNs

Mitigation Strategies

Practical Tips

Conclusion

References

Long Short-Term Memory Network (LSTM)

Pokemon (Part 1) Practical Dataset - Building a Custom Pokemon Dataset with PyTorch

You may also like

Reconstructing the MNIST Dataset Using VAE

Variational Autoencoder

Unsupervised Learning and Autoencoders

Latest Posts

Prisma - Filtering

Prisma - Transaction Handling — Batch and Interactive

View Categories

Explore Tags

Get Interesting News