Weights Only Quantization (INT4)

In this exercise, we will use the previously created workbench (Section 2) to investigate how LLMs weights can be quantized to INT4 for memory savings and inference acceleration. This quantization method is particularly useful for:

Reducing model size
Maintaining low latency in workloads with low queries per second (QPS)

🚨 The time for the workshop is limited. Therefore, please do only one of the INT versions (either INT4 or INT8) and then FP8, which is fast as it does not need calibration data. You may choose to skip this section or section 3.2.

Quantization Process

The quantization process involves the following steps:

Load the Model: Load the pre-trained LLM model.
Prepare Calibration Dataset: Prepare a dataset for calibration.
Quantize the Model: Convert the model weights to INT4 format.
- Using GPTQ
Evaluate the Model: Evaluate the quantized model’s accuracy.

🚨 After quantizing the model, the GPU memory may not be freed. You need to restart the kernel before evaluating the model to ensure you have enough GPU RAM available.

Exercise: Quantize the Model with llm-compressor

Go to the workbench created in the previous section (Section 2). From the neural-magic-workshop/lab-materials/03 folder, please open the notebook called weight_only_quantization.ipynb and follow the instructions.

When done, you can close the notebook and head to the next page.

🚨 Once you complete all the quantization exercises and you no longer need the workbench, ensure you stop it so that the associated GPU gets freed and can be utilized to serve the model.