The era of Artificial Intelligence is defined by the rapid evolution and adoption of Large Language Models (LLMs). As these powerful models move from research labs to production environments, the challenge shifts from training cutting-edge models to efficiently serving them at scale.
This deep dive into LLM inference optimization explores six leading runtimes crucial for seamless large language model deployment. We’ll uncover how each engine tackles the critical bottlenecks of speed, cost, and memory, primarily through sophisticated Key-Value (KV) cache management. For anyone involved in AI model serving, understanding these distinctions is vital for achieving optimal performance and user experience.
Decoding LLM Runtimes: The Quest for Efficient AI Model Serving
Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. This efficiency hinges on three implementation details: how the runtime batches requests, how it overlaps prefill and decode operations, and critically, how it stores and reuses the KV cache. Different engines make distinct tradeoffs on these axes, which directly impact metrics like tokens per second, P50/P99 latency, and GPU memory usage.
This article provides a comprehensive comparison of six prominent runtimes frequently encountered in production stacks, offering insights into their design philosophies and performance characteristics.
1. vLLM: PagedAttention for High Throughput
Design
vLLM distinguishes itself with its innovative PagedAttention mechanism. Rather than storing each sequence’s KV cache in a single, large contiguous buffer, PagedAttention partitions the KV cache into fixed-size blocks. An indirection layer then allows each sequence to point to a list of these blocks. This design offers several significant advantages:
- Very low KV fragmentation, with reported waste often less than 4%, a stark contrast to the 60–80% seen in naïve allocators.
- High GPU utilization through continuous batching, ensuring the GPU is always busy.
- Native support for prefix sharing and KV reuse at the block level, which is highly efficient for multi-turn conversations or similar prompts.
Recent iterations of vLLM have further enhanced its capabilities by adding KV quantization (FP8) and integrating FlashAttention-style kernels for accelerated attention computations.
Performance
Evaluations consistently highlight vLLM’s superior performance:
- vLLM achieves 14–24× higher throughput than standard Hugging Face Transformers and 2.2–3.5× higher than early versions of TGI for LLaMA models on NVIDIA GPUs.
KV and Memory Behavior
PagedAttention provides a KV layout that is both GPU-friendly and resistant to fragmentation. The introduction of FP8 KV quantization further reduces the KV cache size, improving decode throughput, especially when compute operations are not the primary bottleneck. This makes vLLM a strong contender for efficient LLM inference optimization.
Where it Fits
vLLM serves as an excellent default, high-performance engine for general LLM serving backends. It delivers impressive throughput, good Time-To-First-Token (TTFT), and robust hardware flexibility, making it a go-to for many deployments.
2. TensorRT LLM: NVIDIA-Optimized Low Latency
Design
TensorRT LLM is a compilation-based engine built atop NVIDIA TensorRT. It generates highly optimized, fused kernels specific to each model and shape, exposing an executor API utilized by frameworks such as Triton Inference Server. Its KV subsystem is particularly explicit and feature-rich:
- Paged KV cache for efficient memory management.
- Quantized KV cache (INT8, FP8) with ongoing advancements.
- Circular buffer KV cache for managing long sequences efficiently.
- Extensive KV cache reuse capabilities, including offloading KV to CPU and reusing it across prompts to significantly reduce TTFT.
NVIDIA reports that CPU-based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios, showcasing its potential for extreme latency reduction.
Performance
TensorRT LLM is highly tunable, leading to variable results. However, common patterns from public comparisons and vendor benchmarks include:
- Very low single-request latency on NVIDIA GPUs when engines are meticulously compiled for the exact model and configuration.
- At moderate concurrency, it can be tuned either for low TTFT or for high throughput. At very high concurrency, throughput-optimized profiles may push P99 latency up due to aggressive batching strategies.
KV and Memory Behavior
The combination of paged and quantized KV caches offers strong control over memory use and bandwidth. Its executor and memory APIs empower developers to design sophisticated cache-aware routing policies at the application layer, further enhancing large language model deployment efficiency.
Where it Fits
TensorRT LLM is ideal for latency-critical workloads and NVIDIA-only environments where teams can invest in specialized engine builds and per-model tuning for maximum optimization.
3. Hugging Face TGI v3: Long Contexts and Integrated Serving
Design
Hugging Face Text Generation Inference (TGI) is a server-focused stack providing a robust ecosystem for LLM deployment. Key features include:
- A Rust-based HTTP and gRPC server for high performance and reliability.
- Continuous batching, streaming output, and integrated safety hooks.
- Backends for PyTorch and TensorRT, with tight integration with the Hugging Face Hub.
TGI v3 introduces a significant advancement with its new long context pipeline, featuring:
- Chunked prefill for processing extremely long inputs without excessive memory use.
- Prefix KV caching, ensuring that long conversation histories are not recomputed on each request, a critical feature for interactive AI.
Performance
For conventional prompts, recent third-party evaluations show:
- vLLM often slightly edges out TGI on raw tokens per second at high concurrency due to PagedAttention, though the difference is often negligible in many setups.
- Crucially, TGI v3 processes approximately 3× more tokens and is up to 13× faster than vLLM on long prompts under setups where very long histories and prefix caching are enabled, making it exceptional for chat applications.
Latency Profile
- P50 latency for short and mid-length prompts is comparable to vLLM when both are tuned with continuous batching.
- For long chat histories, where prefill typically dominates in naïve pipelines, TGI v3’s reuse of earlier tokens provides a massive win in TTFT and P50 latency.
KV and Memory Behavior
TGI leverages KV caching with paged attention-style kernels and significantly reduces memory footprint through chunking of prefill and other runtime optimizations. It integrates quantization through libraries like bitsandbytes and GPTQ, supporting various hardware backends.
Where it Fits
TGI v3 is an excellent choice for production stacks already integrated with Hugging Face, especially for chat-style workloads with long histories where prefix caching yields substantial real-world performance gains.
4. LMDeploy: TurboMind for Maximum Throughput
Design
LMDeploy, part of the InternLM ecosystem, is a comprehensive toolkit for LLM compression and deployment. It offers two primary inference engines:
- TurboMind: High-performance CUDA kernels specifically optimized for NVIDIA GPUs.
- PyTorch engine: A flexible fallback for broader compatibility.
Key runtime features designed for peak performance include:
- Persistent, continuous batching for sustained high throughput.
- A blocked KV cache with an advanced manager for allocation and reuse, similar in concept to PagedAttention but with distinct internal layout.
- Dynamic split and fuse for attention blocks.
- Tensor parallelism for scaling large models across multiple GPUs.
- Comprehensive weight-only and KV quantization, including AWQ and online INT8 / INT4 KV quantization, essential for memory-constrained scenarios.
LMDeploy claims up to 1.8× higher request throughput than vLLM, attributing this to its persistent batching, blocked KV cache, and highly optimized kernels.
Performance
Evaluations demonstrate LMDeploy’s prowess:
- For 4-bit Llama-style models on A100 GPUs, LMDeploy can achieve higher tokens per second than vLLM under comparable latency constraints, particularly at high concurrency.
- It also reports that 4-bit inference is approximately 2.4× faster than FP16 for supported models, a crucial factor in practical LLM inference optimization.
Latency
- Single-request TTFT is competitive with other optimized GPU engines when configured without extreme batch limits.
- Under heavy concurrency, its persistent batching combined with the blocked KV cache allows LMDeploy to sustain high throughput without TTFT collapse.
KV and Memory Behavior
The blocked KV cache manages KV chunks in a grid, similar in spirit to vLLM’s PagedAttention, but with a unique internal layout for efficiency. Its robust support for weight and KV quantization is specifically tailored for deploying large models on constrained GPUs, maximizing resource utilization.
Where it Fits
LMDeploy is an excellent fit for NVIDIA-centric deployments prioritizing maximum throughput and are comfortable leveraging TurboMind and LMDeploy’s specific tooling.
5. SGLang: RadixAttention for Structured AI Workloads
Design
SGLang is a dual-purpose solution:
- A Domain Specific Language (DSL) for building structured LLM programs, such as agents, RAG workflows, and tool-use pipelines, allowing for more predictable and efficient generation.
- A runtime that implements RadixAttention, a novel KV reuse mechanism that shares prefixes using a radix tree structure, rather than simple block hashes.
RadixAttention’s key innovation is its ability to:
- Store KV caches for numerous requests in a prefix tree, keyed by tokens.
- Enable exceptionally high KV hit rates when many calls share prefixes, which is common in few-shot prompts, multi-turn chat, or complex tool chains.
Performance
Key insights from SGLang’s performance include:
- SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems like vLLM and LMQL on structured workloads.
- These improvements are most pronounced in scenarios with heavy prefix reuse, such as multi-turn conversational agents or evaluation benchmarks with repeated contexts.
Reported KV cache hit rates range from approximately 50% to 99%, with cache-aware schedulers achieving near-optimal hit rates on measured benchmarks.
KV and Memory Behavior
RadixAttention is built upon paged attention-style kernels, focusing intensely on intelligent reuse beyond mere allocation. SGLang also integrates well with hierarchical context caching systems that can move KV between GPU and CPU for extremely long sequences, though such systems are often implemented as separate projects.
Where it Fits
SGLang is perfectly suited for agentic systems, tool pipelines, and heavy RAG (Retrieval Augmented Generation) applications where many calls share large prompt prefixes and where advanced KV reuse at the application level can yield significant performance benefits.
6. DeepSpeed Inference / ZeRO Inference: Scaling to Giant Models
Design
DeepSpeed provides two critical components for inference at extreme scale:
- DeepSpeed Inference: Optimized transformer kernels coupled with tensor and pipeline parallelism.
- ZeRO Inference / ZeRO Offload: Techniques designed to offload model weights, and in some configurations, the KV cache, to CPU or NVMe storage. This allows for running exceptionally large models on GPUs with limited memory.
ZeRO Inference focuses on:
- Keeping minimal or no model weights resident in GPU memory.
- Streaming tensors from CPU or NVMe storage as needed during inference.
- Targeting throughput and maximizing model size rather than ultra-low latency.
Performance
Consider the ZeRO Inference OPT 30B example on a single V100 32GB GPU:
- Full CPU offload achieves approximately 43 tokens per second.
- Full NVMe offload reaches about 30 tokens per second.
- Both configurations are 1.3–2.4× faster than partial offload setups because full offload enables larger batch sizes by freeing up GPU memory.
While these token-per-second numbers might seem modest compared to GPU-resident LLM runtimes on A100 or H100, it’s crucial to remember they apply to a model that would otherwise not fit natively within 32GB of GPU memory. A recent I/O characterization confirms that offload-based systems, like DeepSpeed, become primarily bottlenecked by small 128 KiB reads from storage.
KV and Memory Behavior
Model weights and sometimes KV blocks are systematically offloaded to CPU or SSD to accommodate models exceeding GPU capacity. While TTFT and P99 latency are higher compared to pure GPU engines, the invaluable tradeoff is the ability to run very large models that would otherwise be impossible on the given hardware.
Where it Fits
DeepSpeed Inference / ZeRO Inference is indispensable for offline or batch inference tasks, or for low QPS (queries per second) services where model size takes precedence over latency and GPU count is limited.
Comparative Overview: Choosing Your LLM Inference Runtime
The table below qualitatively summarizes the main tradeoffs and ideal use cases for each runtime, highlighting their unique strengths in AI model serving.
| Runtime | Main Design Idea | Relative Strength | KV Strategy | Typical Use Case |
|---|---|---|---|---|
| vLLM | PagedAttention, continuous batching | High tokens per second at given TTFT | Paged KV blocks, FP8 KV support | General purpose GPU serving, multi-hardware |
| TensorRT LLM | Compiled kernels on NVIDIA + KV reuse | Very low latency and high throughput on NVIDIA | Paged, quantized KV, reuse and offload | NVIDIA only, latency sensitive |
| TGI v3 | HF serving layer with long prompt path | Strong long prompt performance, integrated stack | Paged KV, chunked prefill, prefix caching | HF centric APIs, long chat histories |
| LMDeploy | TurboMind kernels, blocked KV, quant | Up to 1.8× vLLM throughput in vendor tests | Blocked KV cache, weight and KV quant | NVIDIA deployments focused on raw throughput |
| SGLang | RadixAttention and structured programs | Up to 6.4× throughput and 3.7× lower latency on structured workloads | Radix tree KV reuse over prefixes | Agents, RAG, high prefix reuse |
| DeepSpeed | GPU CPU NVMe offload for huge models | Enables large models on small GPU; throughput oriented | Offloaded weights and sometimes KV | Very large models, offline or low QPS |
Choosing the Right Runtime for Your Production System
For a production system, the choice often simplifies to a few key patterns based on your specific needs and infrastructure:
- For a strong default engine with minimal custom work: Start with vLLM. It provides excellent throughput, reasonable Time-To-First-Token (TTFT), and robust KV handling across common hardware, making it a versatile choice for many.
- For NVIDIA-committed environments needing fine-grained control over latency and KV: Opt for TensorRT LLM, likely deployed behind Triton Inference Server or TGI. Be prepared to invest in model-specific engine builds and extensive tuning for peak performance.
- If your stack is already on Hugging Face and long chat histories are critical: TGI v3 is your best bet. Its long prompt pipeline and intelligent prefix caching are exceptionally effective for conversational AI traffic, drastically improving user experience.
- To achieve maximum throughput per GPU with quantized models: Consider LMDeploy with TurboMind and its blocked KV cache, particularly effective for 4-bit Llama family models on NVIDIA hardware.
- When building agents, tool chains, or heavy RAG systems: Leverage SGLang and strategically design prompts to maximize KV reuse via RadixAttention. This is crucial for optimizing structured AI programs.
- If you must run very large models on limited GPU resources: DeepSpeed Inference / ZeRO Inference is essential. Accept higher latency in exchange for the ability to deploy models that would otherwise be impossible, treating the GPU as a throughput engine augmented by SSD storage.
Ultimately, all these advanced engines underscore a converging truth: the KV cache is the primary bottleneck resource in LLM inference. The winning runtimes are those that treat the KV cache as a first-class data structure—to be paged, quantized, reused, and even offloaded—rather than merely a large tensor allocated in GPU memory. This sophisticated management is the key to unlocking the full potential of GPU acceleration for LLMs.
Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.
🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.
FAQ
Question 1: What is the KV cache and why is it so important for LLM inference?
The KV (Key-Value) cache stores the ‘keys’ and ‘values’ computed during the attention mechanism for previously processed tokens in a sequence. In Transformer-based LLMs, for each token in a sequence, attention weights are calculated by comparing a ‘query’ with ‘keys’ from all preceding tokens. These ‘keys’ and ‘values’ are then used to create a weighted sum that forms the output of the attention layer.
The KV cache is critical because it prevents redundant computations. Without it, every new token generated would require recomputing the keys and values for all previous tokens in the sequence, leading to a drastic increase in computational load and latency, especially for longer contexts. By caching these, the model only needs to compute keys and values for the *new* token, significantly speeding up the decoding process. It directly impacts performance metrics like tokens per second and latency, making efficient KV cache management central to LLM inference optimization.
Unique Tip: The KV cache is becoming even more critical for multimodal LLMs and models with extremely long context windows (like Google Gemini 1.5 Pro with its 1 million token context). Efficiently managing a KV cache of this magnitude often involves techniques like sparse attention, advanced compression, or hierarchical caching between GPU and host memory to prevent it from becoming an overwhelming memory bottleneck.
Question 2: How do these LLM runtimes address the challenges of serving models with very long context windows?
Serving LLMs with very long context windows presents significant memory and computational challenges because the KV cache grows linearly with context length. Runtimes tackle this in several innovative ways:
- Prefix Sharing and Reuse: Runtimes like TGI v3 and SGLang (with RadixAttention) are designed to identify and reuse common prefixes across multiple requests or turns in a conversation. This means if several queries start with the same large prompt, the KV cache for that common prefix is computed only once and shared, saving significant computation and memory.
- Chunked Prefill: TGI v3 uses chunked prefill, where long input sequences are processed in smaller segments rather than all at once. This reduces peak memory usage during the initial prefill stage.
- Offloading and Quantization: DeepSpeed’s ZeRO Inference offloads model weights and sometimes KV cache blocks to CPU or even NVMe SSDs, allowing models that would otherwise exceed GPU memory to run. Quantization (e.g., FP8, INT8 for KV cache in vLLM, TensorRT LLM, LMDeploy) reduces the memory footprint of the KV cache itself, making it feasible to store more tokens in GPU memory.
- Paged KV Caches: Most modern runtimes (vLLM, TensorRT LLM, LMDeploy) implement paged KV caches, which allocate memory in fixed-size blocks. This virtualized memory management ensures efficient utilization of GPU memory, minimizing fragmentation and allowing more flexible allocation for varying context lengths.
These strategies collectively enable more efficient large language model deployment even with demanding, long-context applications.

