Table of Contents
Short Summary: Model performance optimization and compression techniques.
32-Bit, 16-Bit, and Mixed Precision Arithmetic
- Train with 32-bit floating point arithmetic.
- Train with half-precision (16-bit) floating point arithmetic.
- Train with mixed precision floating point arithmetic.
- Uses both 16-bit and 32-bit math -- keeps certain parts 32-bit for numerical stability.
Quantization
- Performs computations and stores tensors at bit widths lower than floating point.
- Use post-training quantization.
- Train the model first and then convert to lower bit widths such as 8-bit integers.
- Use quantization aware training.
- Emulates inference-time quantization.
- Creates models that downstream tools will use to produce actually quantized models.
Model Parameters
- Use pruning techniques to get rid of insignificant weights.
- Prune the units with the lowest L1-norm (other "reasonable" norms could also work).
- Prune the units at random.
- In decision trees, pruning would reduce complexity by removing sections of the tree.
- Carefully evaluate the model and see if removing some layers (less parameters) is possible.