Neural Magic Solutions Neural Magic, established in 2018 by MIT professor Nir Shavit and research scientist Alex Matveev, is a Series-A company specializing in AI model optimization and accelerated inference serving. Headquartered in Somerville, MA, the company focuses on enabling enterprise deployment of open-source machine learning models across edge, datacenter, and cloud environments. Company Overview Neural Magic is a pioneering AI software company focused on optimizing and accelerating machine learning models for efficient deployment across various computing environments, including cloud, data centers, and edge devices. Their core innovations involve sparsification, quantization, and efficient inference serving, allowing AI models to run smoothly on standard CPUs and GPUs without sacrificing performance. Neural Magic Product Suite DeepSparse: High-Performance AI Inference on CPUs DeepSparse is Neural Magic’s inference engine that maximizes CPU efficiency for AI workloads. It leverages sparsification and quantization to accelerate model inference without specialized hardware. AI inference engine leveraging sparsification for efficient execution Optimized for computer vision (CV), natural language processing (NLP), and large language models (LLMs) Seamlessly integrates with Red Hat OpenShift AI Achieves GPU-class performance on commodity CPU hardware Learn more about DeepSparse LLM-Compressor: Specialized LLM Optimization LLM-Compressor is a Transformers-compatible library for applying various compression algorithms to Large Language Models (LLMs) for optimized deployment with vLLM. Key capabilities: One-command compression of popular LLMs Comprehensive set of quantization algorithms for weight-only and activation quantization Seamless integration with Hugging Face models and repositories safetensors-based file format compatible with vLLM Large model support via accelerate Support for various compression techniques (pruning, quantization, distillation) Maintains model quality while reducing size and computational requirements Integrates with popular LLM frameworks Supports multiple models Supported formats: Activation Quantization: W8A8 (int8 and fp8) Mixed Precision: W4A16, W8A16 2:4 Semi-structured and Unstructured Sparsity Supported algorithms: Simple PTQ (Post-Training Quantization) GPTQ (Generative Pretrained Transformer Quantization) SmoothQuant SparseGPT Optimization options: W4A16: Uses GPTQ to compress weights to 4 bits, recommended for any GPU types W8A8-INT8: Channel-wise weight quantization with dynamic per-token activation quantization W8A8-FP8: For NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace) 2:4-Sparsity with FP8: Semi-structured sparsity where two of every four contiguous weights are set to zero Compression results: Up to 90% reduction in model size 2-10x inference speedup on CPU Minimal accuracy loss with optimized techniques Explore LLM-Compressor on GitHub vLLM Integration: Efficient Large Language Model Processing vLLM is a high-performance, memory-efficient inference and serving engine for Large Language Models (LLMs). Neural Magic provides optimized integration with vLLM to enable fast and cost-effective LLM deployment. Core technologies: PagedAttention: Efficiently manages attention key and value memory, significantly reducing memory usage Continuous batching: Dynamically processes incoming requests without waiting for batch formation CUDA/HIP graph optimization: Accelerates model execution with optimized GPU computation graphs Optimized kernels: Includes integration with FlashAttention and FlashInfer for maximum performance Chunked prefill: Processes long contexts more efficiently by breaking them into manageable chunks Performance advantages: State-of-the-art serving throughput compared to other LLM serving solutions Up to 24x higher throughput than naive implementations Reduced latency through optimized memory management Support for speculative decoding to accelerate generation Hardware support: NVIDIA GPUs (primary platform) AMD CPUs and GPUs (ROCm) Intel CPUs and GPUs (XPU) PowerPC CPUs Google TPUs AWS Neuron (Inferentia and Trainium) Habana Gaudi accelerators Model support: Transformer-based LLMs (Llama, Mistral, Falcon, etc.) Mixture-of-Expert models (Mixtral, Deepseek-V2/V3) Embedding models (E5-Mistral) Multi-modal LLMs (LLaVA) Comprehensive support for most popular Hugging Face models Deployment features: OpenAI-compatible API server for easy integration Tensor parallelism and pipeline parallelism for distributed inference Streaming output support Automatic prefix caching for improved throughput Multi-LoRA adapter support for model customization Integration benefits: Seamless deployment on Red Hat OpenShift AI Containerized deployment with Docker and Kubernetes Compatible with popular frameworks like LangChain and LlamaIndex Supports both offline batch inference and online serving Learn more about vLLM vLLM Documentation vLLM GitHub Repository Learn more about Neural Magic’s vLLM integration Neural Magic and Red Hat AI Neural Magic’s AI optimizations complement Red Hat OpenShift AI and RHEL AI, enabling: Optimized LLM deployments across hybrid cloud environments Cost-effective AI inferencing without expensive GPUs Seamless integration with containerized environments for scalable AI workloads Use Cases Enterprise LLM deployment: Run large language models efficiently on existing CPU infrastructure Edge AI: Deploy optimized models on edge devices with limited resources Cost-effective scaling: Expand AI capabilities without proportional increases in infrastructure costs Real-time applications: Enable faster response times for time-sensitive AI applications Additional Resources Neural Magic Homepage Neural Magic Blog DeepSparse GitHub LLM-Compressor GitHub vLLM GitHub 1.4 Model Optimization Techniques 2.1 Getting connected