Hands-On AI Optimization with OpenShift AI Lab Overview This lab will illustrate how to leverage Neural Magic’s model optimization techniques (llm-compressor) on OpenShift AI to enable fitting your model to your hardware. You will learn how to achieve significant performance improvements and cost reductions through model sparsification and quantization. The information, code, pipelines and techniques it contains are illustrations of what a first prototype could look like. Lab Walkthrough This lab is designed to provide you with a structured approach to understanding and utilizing the tools and techniques essential for AI Model optimization. Below is an overview of the key activities you will engage in: Get Familiar with OpenShift AI Explore the capabilities of OpenShift AI, which will serve as the foundation for your data science projects. Create Data Science Projects: Establish your own Data Science project to organize and manage your work effectively. Create Data Connections: Learn how to connect various data sources, ensuring that your projects has access to the necessary information. Create Pipeline Servers: Set up a pipeline server, which is crucial for managing and executing your workflows efficiently. Create Workbenches: Access your workbench, a dedicated space for reviewing content, running experiments, and collaborating with peers. Define Pipelines: Design the pathways through which your data will flow, optimizing each step of the process. Optimize Models with llm-compressor Engage in the optimization process by utilizing the llm-compressor to enhance your models. Through Workbenches and Pipelines Apply optimization techniques directly within your workbench and pipelines to improve model performance. Evaluate Models with lm-eval Assess the effectiveness of your models using lm-eval. This evaluation will provide insights into their performance and areas for improvement. Deploy Models with vLLM (ServingRuntime) Deploy your models using vLLM, with and without optimization. Compare the performance of your base model against the optimized version to understand the impact of your efforts. This walkthrough is designed to equip you with the necessary skills and knowledge to start in your AI optimization projects. We encourage you to engage fully with each step as you progress through the workshop. Disclaimer This lab is an example of what a customer could use to optimize its models for enhanced inferencing while reducing hardware costs using OpenShift AI. This lab makes use of "small" size large language models (LLM) for speeding up the process, but the same techniques would apply for larger models. Repository This repository https://github.com/luis5tb/neural-magic-workshop/ contains workshop materials and examples for creating optimized models with Neural Magic solutions on OpenShift. This workshop is part of the OpenShift Commons initiative. Repository Structure Lab Content The lab content is at the content folder, on the modules/ROOT/pages directory. Lab Material The lab-materials/ directory contains the different materials for the different points: Section 2 (folder 02): Sample pipeline yaml file: test_pipeline.yaml Section 3 (folder 03): Jupyter notebooks for model optimization techniques and evaluation: Weights quantization: int4 Weights and activations quantization: fp8 and int8 Section 4 (folder 04): Python script for testing deployed models: request.py Section 5 (folder 05): Pipeline source file (quantization_pipeline.py) for optimizing models while maintaining accuracy, also includes evaluation steps. Prerequisites OpenShift AI NVIDIA GPUs Workshop Timetable This is a tentative timetable for the materials that will be presented today. Time Section Type Description 0 - 15 min Intro Talk & AI Optimization Overview Presentation What is OpenShift AI? (Portfolio overview) Why Optimize AI Models? (Challenges with GPUs, benefits of compression) Introduction to vLLM & Neural Magic 15 - 25 min Workshop Setup Hands-On Guide participants through demo.redhat.com setup Explain workshop environment & MinIO storage 25 - 55 min Workbench Creation & Model Quantization Hands-On Hands-on: Step-by-step quantization of a model manually Int4 Int8 fp8 55 - 75 min Deploy Base and Optimized Model Hands-On Deploy a base AI model using OpenShift AI ServingRuntime Deploy a pre-optimized model from MinIO Compare performance vs. base model 75 - 85 min Pipeline Creation & Deployment Hands-On Create an AI pipeline backed by MinIO Import a pipeline YAML and trigger the pipeline with parameters 85 - 90 min Q&A + Wrap-Up Hands-On Recap learnings Open discussion on real-world AI optimization Contributing If you are interested in contributing to this project, consult this GitHub Repo: https://github.com/luis5tb/neural-magic-workshop/ 1.1 What are LLMs?