Quantization is the process of reducing the numerical precision of a model’s weights and activations (e.g., converting from 32-bit floats to 8-bit integers) to decrease memory footprint and improve inference speed with minimal impact on model accuracy.
Sources:
- TensorFlow Model Optimization: Post-Training Quantization
- NVIDIA Developer: Quantization Techniques for DNNs