David Oniani | Model Performance and Compression

Model Performance and Compression

Table of Contents

Short Summary: Model performance optimization and compression techniques.

32-Bit, 16-Bit, and Mixed Precision Arithmetic

Train with 32-bit floating point arithmetic.
Train with half-precision (16-bit) floating point arithmetic.
Train with mixed precision floating point arithmetic.
- Uses both 16-bit and 32-bit math -- keeps certain parts 32-bit for numerical stability.

Quantization

Performs computations and stores tensors at bit widths lower than floating point.
Use post-training quantization.
- Train the model first and then convert to lower bit widths such as 8-bit integers.
Use quantization aware training.
- Emulates inference-time quantization.
- Creates models that downstream tools will use to produce actually quantized models.

Model Parameters

Use pruning techniques to get rid of insignificant weights.
- Prune the units with the lowest L1-norm (other "reasonable" norms could also work).
- Prune the units at random.
- In decision trees, pruning would reduce complexity by removing sections of the tree.
Carefully evaluate the model and see if removing some layers (less parameters) is possible.