Hands-On AI Optimization with OpenShift AI

Lab Overview

This lab will illustrate how to leverage Neural Magic’s model optimization techniques (llm-compressor) on OpenShift AI to enable fitting your model to your hardware. You will learn how to achieve significant performance improvements and cost reductions through model sparsification and quantization.

The information, code, pipelines and techniques it contains are illustrations of what a first prototype could look like.

Lab Walkthrough

This lab is designed to provide you with a structured approach to understanding and utilizing the tools and techniques essential for AI Model optimization. Below is an overview of the key activities you will engage in:

Get Familiar with OpenShift AI
- Explore the capabilities of OpenShift AI, which will serve as the foundation for your data science projects.
- Create Data Science Projects: Establish your own Data Science project to organize and manage your work effectively.
- Create Data Connections: Learn how to connect various data sources, ensuring that your projects has access to the necessary information.
- Create Pipeline Servers: Set up a pipeline server, which is crucial for managing and executing your workflows efficiently.
- Create Workbenches: Access your workbench, a dedicated space for reviewing content, running experiments, and collaborating with peers.
- Define Pipelines: Design the pathways through which your data will flow, optimizing each step of the process.
Optimize Models with llm-compressor
- Engage in the optimization process by utilizing the llm-compressor to enhance your models.
- Through Workbenches and Pipelines
- Apply optimization techniques directly within your workbench and pipelines to improve model performance.
Evaluate Models with lm-eval
- Assess the effectiveness of your models using lm-eval. This evaluation will provide insights into their performance and areas for improvement.
Deploy Models with vLLM (ServingRuntime)
- Deploy your models using vLLM, with and without optimization.
- Compare the performance of your base model against the optimized version to understand the impact of your efforts.

This walkthrough is designed to equip you with the necessary skills and knowledge to start in your AI optimization projects. We encourage you to engage fully with each step as you progress through the workshop.

Disclaimer

This lab is an example of what a customer could use to optimize its models for enhanced inferencing while reducing hardware costs using OpenShift AI.

This lab makes use of "small" size large language models (LLM) for speeding up the process, but the same techniques would apply for larger models.

Repository

This repository https://github.com/luis5tb/neural-magic-workshop/ contains workshop materials and examples for creating optimized models with Neural Magic solutions on OpenShift. This workshop is part of the OpenShift Commons initiative.

Repository Structure

Lab Content

The lab content is at the content folder, on the modules/ROOT/pages directory.

Lab Material

The lab-materials/ directory contains the different materials for the different points:

Section 2 (folder 02):
- Sample pipeline yaml file: test_pipeline.yaml
Section 3 (folder 03):
- Jupyter notebooks for model optimization techniques and evaluation:
  - Weights quantization: int4
  - Weights and activations quantization: fp8 and int8
Section 4 (folder 04):
- Python script for testing deployed models: request.py
Section 5 (folder 05):
- Pipeline source file (quantization_pipeline.py) for optimizing models while maintaining accuracy, also includes evaluation steps.

Prerequisites

OpenShift AI
NVIDIA GPUs

Workshop Timetable

This is a tentative timetable for the materials that will be presented today.

Time	Section	Type	Description
0 - 15 min	Intro Talk & AI Optimization Overview	Presentation	What is OpenShift AI? (Portfolio overview) Why Optimize AI Models? (Challenges with GPUs, benefits of compression) Introduction to vLLM & Neural Magic
15 - 25 min	Workshop Setup	Hands-On	Guide participants through demo.redhat.com setup Explain workshop environment & MinIO storage
25 - 55 min	Workbench Creation & Model Quantization	Hands-On	Hands-on: Step-by-step quantization of a model manually Int4 Int8 fp8
55 - 75 min	Deploy Base and Optimized Model	Hands-On	Deploy a base AI model using OpenShift AI ServingRuntime Deploy a pre-optimized model from MinIO Compare performance vs. base model
75 - 85 min	Pipeline Creation & Deployment	Hands-On	Create an AI pipeline backed by MinIO Import a pipeline YAML and trigger the pipeline with parameters
85 - 90 min	Q&A + Wrap-Up	Hands-On	Recap learnings Open discussion on real-world AI optimization

Time

Section

Type

Description

0 - 15 min

Intro Talk & AI Optimization Overview

Presentation

What is OpenShift AI? (Portfolio overview)
Why Optimize AI Models? (Challenges with GPUs, benefits of compression)
Introduction to vLLM & Neural Magic

15 - 25 min

Workshop Setup

Hands-On

Guide participants through demo.redhat.com setup
Explain workshop environment & MinIO storage

25 - 55 min

Workbench Creation & Model Quantization

Hands-On

Hands-on: Step-by-step quantization of a model manually
Int4
Int8
fp8

55 - 75 min

Deploy Base and Optimized Model

Hands-On

Deploy a base AI model using OpenShift AI ServingRuntime
Deploy a pre-optimized model from MinIO
Compare performance vs. base model

75 - 85 min

Pipeline Creation & Deployment

Hands-On

Create an AI pipeline backed by MinIO
Import a pipeline YAML and trigger the pipeline with parameters

85 - 90 min

Q&A + Wrap-Up

Hands-On

Recap learnings
Open discussion on real-world AI optimization

Contributing

If you are interested in contributing to this project, consult this GitHub Repo: https://github.com/luis5tb/neural-magic-workshop/