Neural Magic Solutions

Neural Magic, established in 2018 by MIT professor Nir Shavit and research scientist Alex Matveev, is a Series-A company specializing in AI model optimization and accelerated inference serving. Headquartered in Somerville, MA, the company focuses on enabling enterprise deployment of open-source machine learning models across edge, datacenter, and cloud environments.

Company Overview

Neural Magic is a pioneering AI software company focused on optimizing and accelerating machine learning models for efficient deployment across various computing environments, including cloud, data centers, and edge devices. Their core innovations involve sparsification, quantization, and efficient inference serving, allowing AI models to run smoothly on standard CPUs and GPUs without sacrificing performance.

Neural Magic Product Suite

DeepSparse: High-Performance AI Inference on CPUs

DeepSparse is Neural Magic’s inference engine that maximizes CPU efficiency for AI workloads. It leverages sparsification and quantization to accelerate model inference without specialized hardware.

AI inference engine leveraging sparsification for efficient execution
Optimized for computer vision (CV), natural language processing (NLP), and large language models (LLMs)
Seamlessly integrates with Red Hat OpenShift AI
Achieves GPU-class performance on commodity CPU hardware

Learn more about DeepSparse

LLM-Compressor: Specialized LLM Optimization

LLM-Compressor is a Transformers-compatible library for applying various compression algorithms to Large Language Models (LLMs) for optimized deployment with vLLM.

Key capabilities:
- One-command compression of popular LLMs
- Comprehensive set of quantization algorithms for weight-only and activation quantization
- Seamless integration with Hugging Face models and repositories
- safetensors-based file format compatible with vLLM
- Large model support via accelerate
- Support for various compression techniques (pruning, quantization, distillation)
- Maintains model quality while reducing size and computational requirements
- Integrates with popular LLM frameworks
- Supports multiple models
Supported formats:
- Activation Quantization: W8A8 (int8 and fp8)
- Mixed Precision: W4A16, W8A16
- 2:4 Semi-structured and Unstructured Sparsity
Supported algorithms:
- Simple PTQ (Post-Training Quantization)
- GPTQ (Generative Pretrained Transformer Quantization)
- SmoothQuant
- SparseGPT
Optimization options:
- W4A16: Uses GPTQ to compress weights to 4 bits, recommended for any GPU types
- W8A8-INT8: Channel-wise weight quantization with dynamic per-token activation quantization
- W8A8-FP8: For NVIDIA GPUs with compute capability >8.9 (Hopper and Ada Lovelace)
- 2:4-Sparsity with FP8: Semi-structured sparsity where two of every four contiguous weights are set to zero
Compression results:
- Up to 90% reduction in model size
- 2-10x inference speedup on CPU
- Minimal accuracy loss with optimized techniques

Explore LLM-Compressor on GitHub

vLLM Integration: Efficient Large Language Model Processing

vLLM is a high-performance, memory-efficient inference and serving engine for Large Language Models (LLMs). Neural Magic provides optimized integration with vLLM to enable fast and cost-effective LLM deployment.

Core technologies:
- PagedAttention: Efficiently manages attention key and value memory, significantly reducing memory usage
- Continuous batching: Dynamically processes incoming requests without waiting for batch formation
- CUDA/HIP graph optimization: Accelerates model execution with optimized GPU computation graphs
- Optimized kernels: Includes integration with FlashAttention and FlashInfer for maximum performance
- Chunked prefill: Processes long contexts more efficiently by breaking them into manageable chunks
Performance advantages:
- State-of-the-art serving throughput compared to other LLM serving solutions
- Up to 24x higher throughput than naive implementations
- Reduced latency through optimized memory management
- Support for speculative decoding to accelerate generation
Hardware support:
- NVIDIA GPUs (primary platform)
- AMD CPUs and GPUs (ROCm)
- Intel CPUs and GPUs (XPU)
- PowerPC CPUs
- Google TPUs
- AWS Neuron (Inferentia and Trainium)
- Habana Gaudi accelerators
Model support:
- Transformer-based LLMs (Llama, Mistral, Falcon, etc.)
- Mixture-of-Expert models (Mixtral, Deepseek-V2/V3)
- Embedding models (E5-Mistral)
- Multi-modal LLMs (LLaVA)
- Comprehensive support for most popular Hugging Face models
Deployment features:
- OpenAI-compatible API server for easy integration
- Tensor parallelism and pipeline parallelism for distributed inference
- Streaming output support
- Automatic prefix caching for improved throughput
- Multi-LoRA adapter support for model customization
Integration benefits:
- Seamless deployment on Red Hat OpenShift AI
- Containerized deployment with Docker and Kubernetes
- Compatible with popular frameworks like LangChain and LlamaIndex
- Supports both offline batch inference and online serving

Learn more about vLLM
vLLM Documentation
vLLM GitHub Repository
Learn more about Neural Magic’s vLLM integration

Neural Magic and Red Hat AI

Neural Magic’s AI optimizations complement Red Hat OpenShift AI and RHEL AI, enabling:

Optimized LLM deployments across hybrid cloud environments
Cost-effective AI inferencing without expensive GPUs
Seamless integration with containerized environments for scalable AI workloads

Use Cases

Enterprise LLM deployment: Run large language models efficiently on existing CPU infrastructure
Edge AI: Deploy optimized models on edge devices with limited resources
Cost-effective scaling: Expand AI capabilities without proportional increases in infrastructure costs
Real-time applications: Enable faster response times for time-sensitive AI applications