Weights and Activation Quantization (FP8)

In this exercise, we will use a notebook to investigate how LLMs weights and activations can be quantized to FP8 for memory savings and inference acceleration. This quantization method allows for:

A 2x reduction in model memory requirements
Up to a 1.6x improvement in throughput with minimal impact on accuracy

🚨 Currently, only Hopper and Ada Lovelace GPUs are officially supported for W8A8. Ampere GPUs are supported for W8A16 (weight-only FP8) utilizing Marlin kernels.

Quantization Process

The quantization process involves the following steps:

Load the Model: Load the pre-trained LLM model.
Quantize the Model: Convert the model weights and activations to FP8 format.
- Using RTN (FP8_dynamic)
- No need for calibration data
Evaluate the Model: Evaluate the quantized model’s accuracy.

🚨 After quantizing the model, the GPU memory may not be freed. You need to restart the kernel before evaluating the model to ensure you have enough GPU RAM available.

Exercise: Quantize the Model with llm-compressor

Go to the workbench created in the previous section (Section 2). From the neural-magic-workshop/lab-materials/03 folder, please open the notebook called weight_activation_quantization_fp8.ipynb and follow the instructions.

When done, you can close the notebook and head to the next page.

🚨 Once you complete all the quantization exercises and you no longer need the workbench, ensure you stop it so that the associated GPU gets freed and can be utilized to serve the model.