Weights and Activation Quantization (INT8)

In this exercise, we will use a notebook to investigate how LLMs weights and activations can be quantized to INT8 for memory savings and inference acceleration. This quantization method is particularly useful for:

  • Reducing model size

  • Maintaining good performance during inference

🚨 The time for the workshop is limited. Please do only one of the INT versions (either INT4 or INT8) and then FP8, which is fast as it does not need calibration data. You may choose to skip this section or section 3.1.

Quantization Process

The quantization process involves the following steps:

  1. Load the Model: Load the pre-trained LLM model.

  2. Prepare Calibration Dataset: Prepare a dataset for calibration.

  3. Quantize the Model: Convert the model weights and activations to INT8 format.

    • Using SmoothQuant and GPTQ

  4. Evaluate the Model: Evaluate the quantized model’s accuracy.

🚨 After quantizing the model, the GPU memory may not be freed. You need to restart the kernel before evaluating the model to ensure you have enough GPU RAM available.
03 restart kernel

Exercise: Quantize the Model with llm-compressor

Go to the workbench created in the previous section (Section 2). From the neural-magic-workshop/lab-materials/03 folder, please open the notebook called weight_activation_quantization.ipynb and follow the instructions.

03 02 int8 notebook

When done, you can close the notebook and head to the next page.

🚨 Once you complete all the quantization exercises and you no longer need the workbench, ensure you stop it so that the associated GPU gets freed and can be utilized to serve the model.
03 workbench done
03 workbench stop