Close Menu
IOupdate | IT News and SelfhostingIOupdate | IT News and Selfhosting
  • Home
  • News
  • Blog
  • Selfhosting
  • AI
  • Linux
  • Cyber Security
  • Gadgets
  • Gaming

Subscribe to Updates

Get the latest creative news from ioupdate about Tech trends, Gaming and Gadgets.

What's Hot

Patches Posted To Allow Hibernation Cancellation On Linux

November 10, 2025

How to use the new Windows 11 Start menu, now rolling out

November 10, 2025

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

November 10, 2025
Facebook X (Twitter) Instagram
Facebook Mastodon Bluesky Reddit
IOupdate | IT News and SelfhostingIOupdate | IT News and Selfhosting
  • Home
  • News
  • Blog
  • Selfhosting
  • AI
  • Linux
  • Cyber Security
  • Gadgets
  • Gaming
IOupdate | IT News and SelfhostingIOupdate | IT News and Selfhosting
Home»Artificial Intelligence»Comparing the Top 6 Inference Runtimes for LLM Serving in 2025
Artificial Intelligence

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

AndyBy AndyNovember 10, 2025No Comments15 Mins Read
Comparing the Top 6 Inference Runtimes for LLM Serving in 2025


The era of Artificial Intelligence is defined by the rapid evolution and adoption of Large Language Models (LLMs). As these powerful models move from research labs to production environments, the challenge shifts from training cutting-edge models to efficiently serving them at scale.

This deep dive into LLM inference optimization explores six leading runtimes crucial for seamless large language model deployment. We’ll uncover how each engine tackles the critical bottlenecks of speed, cost, and memory, primarily through sophisticated Key-Value (KV) cache management. For anyone involved in AI model serving, understanding these distinctions is vital for achieving optimal performance and user experience.

Decoding LLM Runtimes: The Quest for Efficient AI Model Serving

Large language models are now limited less by training and more by how fast and cheaply we can serve tokens under real traffic. This efficiency hinges on three implementation details: how the runtime batches requests, how it overlaps prefill and decode operations, and critically, how it stores and reuses the KV cache. Different engines make distinct tradeoffs on these axes, which directly impact metrics like tokens per second, P50/P99 latency, and GPU memory usage.

This article provides a comprehensive comparison of six prominent runtimes frequently encountered in production stacks, offering insights into their design philosophies and performance characteristics.

1. vLLM: PagedAttention for High Throughput

Design

vLLM distinguishes itself with its innovative PagedAttention mechanism. Rather than storing each sequence’s KV cache in a single, large contiguous buffer, PagedAttention partitions the KV cache into fixed-size blocks. An indirection layer then allows each sequence to point to a list of these blocks. This design offers several significant advantages:

  • Very low KV fragmentation, with reported waste often less than 4%, a stark contrast to the 60–80% seen in naïve allocators.
  • High GPU utilization through continuous batching, ensuring the GPU is always busy.
  • Native support for prefix sharing and KV reuse at the block level, which is highly efficient for multi-turn conversations or similar prompts.

Recent iterations of vLLM have further enhanced its capabilities by adding KV quantization (FP8) and integrating FlashAttention-style kernels for accelerated attention computations.

Performance

Evaluations consistently highlight vLLM’s superior performance:

  • vLLM achieves 14–24× higher throughput than standard Hugging Face Transformers and 2.2–3.5× higher than early versions of TGI for LLaMA models on NVIDIA GPUs.

KV and Memory Behavior

PagedAttention provides a KV layout that is both GPU-friendly and resistant to fragmentation. The introduction of FP8 KV quantization further reduces the KV cache size, improving decode throughput, especially when compute operations are not the primary bottleneck. This makes vLLM a strong contender for efficient LLM inference optimization.

Where it Fits

vLLM serves as an excellent default, high-performance engine for general LLM serving backends. It delivers impressive throughput, good Time-To-First-Token (TTFT), and robust hardware flexibility, making it a go-to for many deployments.

2. TensorRT LLM: NVIDIA-Optimized Low Latency

Design

TensorRT LLM is a compilation-based engine built atop NVIDIA TensorRT. It generates highly optimized, fused kernels specific to each model and shape, exposing an executor API utilized by frameworks such as Triton Inference Server. Its KV subsystem is particularly explicit and feature-rich:

  • Paged KV cache for efficient memory management.
  • Quantized KV cache (INT8, FP8) with ongoing advancements.
  • Circular buffer KV cache for managing long sequences efficiently.
  • Extensive KV cache reuse capabilities, including offloading KV to CPU and reusing it across prompts to significantly reduce TTFT.

NVIDIA reports that CPU-based KV reuse can reduce time to first token by up to 14× on H100 and even more on GH200 in specific scenarios, showcasing its potential for extreme latency reduction.

Performance

TensorRT LLM is highly tunable, leading to variable results. However, common patterns from public comparisons and vendor benchmarks include:

  • Very low single-request latency on NVIDIA GPUs when engines are meticulously compiled for the exact model and configuration.
  • At moderate concurrency, it can be tuned either for low TTFT or for high throughput. At very high concurrency, throughput-optimized profiles may push P99 latency up due to aggressive batching strategies.

KV and Memory Behavior

The combination of paged and quantized KV caches offers strong control over memory use and bandwidth. Its executor and memory APIs empower developers to design sophisticated cache-aware routing policies at the application layer, further enhancing large language model deployment efficiency.

Where it Fits

TensorRT LLM is ideal for latency-critical workloads and NVIDIA-only environments where teams can invest in specialized engine builds and per-model tuning for maximum optimization.

3. Hugging Face TGI v3: Long Contexts and Integrated Serving

Design

Hugging Face Text Generation Inference (TGI) is a server-focused stack providing a robust ecosystem for LLM deployment. Key features include:

  • A Rust-based HTTP and gRPC server for high performance and reliability.
  • Continuous batching, streaming output, and integrated safety hooks.
  • Backends for PyTorch and TensorRT, with tight integration with the Hugging Face Hub.

TGI v3 introduces a significant advancement with its new long context pipeline, featuring:

  • Chunked prefill for processing extremely long inputs without excessive memory use.
  • Prefix KV caching, ensuring that long conversation histories are not recomputed on each request, a critical feature for interactive AI.

Performance

For conventional prompts, recent third-party evaluations show:

  • vLLM often slightly edges out TGI on raw tokens per second at high concurrency due to PagedAttention, though the difference is often negligible in many setups.
  • Crucially, TGI v3 processes approximately 3× more tokens and is up to 13× faster than vLLM on long prompts under setups where very long histories and prefix caching are enabled, making it exceptional for chat applications.

Latency Profile

  • P50 latency for short and mid-length prompts is comparable to vLLM when both are tuned with continuous batching.
  • For long chat histories, where prefill typically dominates in naïve pipelines, TGI v3’s reuse of earlier tokens provides a massive win in TTFT and P50 latency.

KV and Memory Behavior

TGI leverages KV caching with paged attention-style kernels and significantly reduces memory footprint through chunking of prefill and other runtime optimizations. It integrates quantization through libraries like bitsandbytes and GPTQ, supporting various hardware backends.

Where it Fits

TGI v3 is an excellent choice for production stacks already integrated with Hugging Face, especially for chat-style workloads with long histories where prefix caching yields substantial real-world performance gains.

4. LMDeploy: TurboMind for Maximum Throughput

Design

LMDeploy, part of the InternLM ecosystem, is a comprehensive toolkit for LLM compression and deployment. It offers two primary inference engines:

  • TurboMind: High-performance CUDA kernels specifically optimized for NVIDIA GPUs.
  • PyTorch engine: A flexible fallback for broader compatibility.

Key runtime features designed for peak performance include:

  • Persistent, continuous batching for sustained high throughput.
  • A blocked KV cache with an advanced manager for allocation and reuse, similar in concept to PagedAttention but with distinct internal layout.
  • Dynamic split and fuse for attention blocks.
  • Tensor parallelism for scaling large models across multiple GPUs.
  • Comprehensive weight-only and KV quantization, including AWQ and online INT8 / INT4 KV quantization, essential for memory-constrained scenarios.

LMDeploy claims up to 1.8× higher request throughput than vLLM, attributing this to its persistent batching, blocked KV cache, and highly optimized kernels.

Performance

Evaluations demonstrate LMDeploy’s prowess:

  • For 4-bit Llama-style models on A100 GPUs, LMDeploy can achieve higher tokens per second than vLLM under comparable latency constraints, particularly at high concurrency.
  • It also reports that 4-bit inference is approximately 2.4× faster than FP16 for supported models, a crucial factor in practical LLM inference optimization.

Latency

  • Single-request TTFT is competitive with other optimized GPU engines when configured without extreme batch limits.
  • Under heavy concurrency, its persistent batching combined with the blocked KV cache allows LMDeploy to sustain high throughput without TTFT collapse.

KV and Memory Behavior

The blocked KV cache manages KV chunks in a grid, similar in spirit to vLLM’s PagedAttention, but with a unique internal layout for efficiency. Its robust support for weight and KV quantization is specifically tailored for deploying large models on constrained GPUs, maximizing resource utilization.

Where it Fits

LMDeploy is an excellent fit for NVIDIA-centric deployments prioritizing maximum throughput and are comfortable leveraging TurboMind and LMDeploy’s specific tooling.

5. SGLang: RadixAttention for Structured AI Workloads

Design

SGLang is a dual-purpose solution:

  • A Domain Specific Language (DSL) for building structured LLM programs, such as agents, RAG workflows, and tool-use pipelines, allowing for more predictable and efficient generation.
  • A runtime that implements RadixAttention, a novel KV reuse mechanism that shares prefixes using a radix tree structure, rather than simple block hashes.

RadixAttention’s key innovation is its ability to:

  • Store KV caches for numerous requests in a prefix tree, keyed by tokens.
  • Enable exceptionally high KV hit rates when many calls share prefixes, which is common in few-shot prompts, multi-turn chat, or complex tool chains.

Performance

Key insights from SGLang’s performance include:

  • SGLang achieves up to 6.4× higher throughput and up to 3.7× lower latency than baseline systems like vLLM and LMQL on structured workloads.
  • These improvements are most pronounced in scenarios with heavy prefix reuse, such as multi-turn conversational agents or evaluation benchmarks with repeated contexts.

Reported KV cache hit rates range from approximately 50% to 99%, with cache-aware schedulers achieving near-optimal hit rates on measured benchmarks.

KV and Memory Behavior

RadixAttention is built upon paged attention-style kernels, focusing intensely on intelligent reuse beyond mere allocation. SGLang also integrates well with hierarchical context caching systems that can move KV between GPU and CPU for extremely long sequences, though such systems are often implemented as separate projects.

Where it Fits

SGLang is perfectly suited for agentic systems, tool pipelines, and heavy RAG (Retrieval Augmented Generation) applications where many calls share large prompt prefixes and where advanced KV reuse at the application level can yield significant performance benefits.

6. DeepSpeed Inference / ZeRO Inference: Scaling to Giant Models

Design

DeepSpeed provides two critical components for inference at extreme scale:

  • DeepSpeed Inference: Optimized transformer kernels coupled with tensor and pipeline parallelism.
  • ZeRO Inference / ZeRO Offload: Techniques designed to offload model weights, and in some configurations, the KV cache, to CPU or NVMe storage. This allows for running exceptionally large models on GPUs with limited memory.

ZeRO Inference focuses on:

  • Keeping minimal or no model weights resident in GPU memory.
  • Streaming tensors from CPU or NVMe storage as needed during inference.
  • Targeting throughput and maximizing model size rather than ultra-low latency.

Performance

Consider the ZeRO Inference OPT 30B example on a single V100 32GB GPU:

  • Full CPU offload achieves approximately 43 tokens per second.
  • Full NVMe offload reaches about 30 tokens per second.
  • Both configurations are 1.3–2.4× faster than partial offload setups because full offload enables larger batch sizes by freeing up GPU memory.

While these token-per-second numbers might seem modest compared to GPU-resident LLM runtimes on A100 or H100, it’s crucial to remember they apply to a model that would otherwise not fit natively within 32GB of GPU memory. A recent I/O characterization confirms that offload-based systems, like DeepSpeed, become primarily bottlenecked by small 128 KiB reads from storage.

KV and Memory Behavior

Model weights and sometimes KV blocks are systematically offloaded to CPU or SSD to accommodate models exceeding GPU capacity. While TTFT and P99 latency are higher compared to pure GPU engines, the invaluable tradeoff is the ability to run very large models that would otherwise be impossible on the given hardware.

Where it Fits

DeepSpeed Inference / ZeRO Inference is indispensable for offline or batch inference tasks, or for low QPS (queries per second) services where model size takes precedence over latency and GPU count is limited.

Comparative Overview: Choosing Your LLM Inference Runtime

The table below qualitatively summarizes the main tradeoffs and ideal use cases for each runtime, highlighting their unique strengths in AI model serving.

RuntimeMain Design IdeaRelative StrengthKV StrategyTypical Use Case
vLLMPagedAttention, continuous batchingHigh tokens per second at given TTFTPaged KV blocks, FP8 KV supportGeneral purpose GPU serving, multi-hardware
TensorRT LLMCompiled kernels on NVIDIA + KV reuseVery low latency and high throughput on NVIDIAPaged, quantized KV, reuse and offloadNVIDIA only, latency sensitive
TGI v3HF serving layer with long prompt pathStrong long prompt performance, integrated stackPaged KV, chunked prefill, prefix cachingHF centric APIs, long chat histories
LMDeployTurboMind kernels, blocked KV, quantUp to 1.8× vLLM throughput in vendor testsBlocked KV cache, weight and KV quantNVIDIA deployments focused on raw throughput
SGLangRadixAttention and structured programsUp to 6.4× throughput and 3.7× lower latency on structured workloadsRadix tree KV reuse over prefixesAgents, RAG, high prefix reuse
DeepSpeedGPU CPU NVMe offload for huge modelsEnables large models on small GPU; throughput orientedOffloaded weights and sometimes KVVery large models, offline or low QPS

Choosing the Right Runtime for Your Production System

For a production system, the choice often simplifies to a few key patterns based on your specific needs and infrastructure:

  • For a strong default engine with minimal custom work: Start with vLLM. It provides excellent throughput, reasonable Time-To-First-Token (TTFT), and robust KV handling across common hardware, making it a versatile choice for many.
  • For NVIDIA-committed environments needing fine-grained control over latency and KV: Opt for TensorRT LLM, likely deployed behind Triton Inference Server or TGI. Be prepared to invest in model-specific engine builds and extensive tuning for peak performance.
  • If your stack is already on Hugging Face and long chat histories are critical: TGI v3 is your best bet. Its long prompt pipeline and intelligent prefix caching are exceptionally effective for conversational AI traffic, drastically improving user experience.
  • To achieve maximum throughput per GPU with quantized models: Consider LMDeploy with TurboMind and its blocked KV cache, particularly effective for 4-bit Llama family models on NVIDIA hardware.
  • When building agents, tool chains, or heavy RAG systems: Leverage SGLang and strategically design prompts to maximize KV reuse via RadixAttention. This is crucial for optimizing structured AI programs.
  • If you must run very large models on limited GPU resources: DeepSpeed Inference / ZeRO Inference is essential. Accept higher latency in exchange for the ability to deploy models that would otherwise be impossible, treating the GPU as a throughput engine augmented by SSD storage.

Ultimately, all these advanced engines underscore a converging truth: the KV cache is the primary bottleneck resource in LLM inference. The winning runtimes are those that treat the KV cache as a first-class data structure—to be paged, quantized, reused, and even offloaded—rather than merely a large tensor allocated in GPU memory. This sophisticated management is the key to unlocking the full potential of GPU acceleration for LLMs.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

FAQ

Question 1: What is the KV cache and why is it so important for LLM inference?

The KV (Key-Value) cache stores the ‘keys’ and ‘values’ computed during the attention mechanism for previously processed tokens in a sequence. In Transformer-based LLMs, for each token in a sequence, attention weights are calculated by comparing a ‘query’ with ‘keys’ from all preceding tokens. These ‘keys’ and ‘values’ are then used to create a weighted sum that forms the output of the attention layer.

The KV cache is critical because it prevents redundant computations. Without it, every new token generated would require recomputing the keys and values for all previous tokens in the sequence, leading to a drastic increase in computational load and latency, especially for longer contexts. By caching these, the model only needs to compute keys and values for the *new* token, significantly speeding up the decoding process. It directly impacts performance metrics like tokens per second and latency, making efficient KV cache management central to LLM inference optimization.

Unique Tip: The KV cache is becoming even more critical for multimodal LLMs and models with extremely long context windows (like Google Gemini 1.5 Pro with its 1 million token context). Efficiently managing a KV cache of this magnitude often involves techniques like sparse attention, advanced compression, or hierarchical caching between GPU and host memory to prevent it from becoming an overwhelming memory bottleneck.

Question 2: How do these LLM runtimes address the challenges of serving models with very long context windows?

Serving LLMs with very long context windows presents significant memory and computational challenges because the KV cache grows linearly with context length. Runtimes tackle this in several innovative ways:

  • Prefix Sharing and Reuse: Runtimes like TGI v3 and SGLang (with RadixAttention) are designed to identify and reuse common prefixes across multiple requests or turns in a conversation. This means if several queries start with the same large prompt, the KV cache for that common prefix is computed only once and shared, saving significant computation and memory.
  • Chunked Prefill: TGI v3 uses chunked prefill, where long input sequences are processed in smaller segments rather than all at once. This reduces peak memory usage during the initial prefill stage.
  • Offloading and Quantization: DeepSpeed’s ZeRO Inference offloads model weights and sometimes KV cache blocks to CPU or even NVMe SSDs, allowing models that would otherwise exceed GPU memory to run. Quantization (e.g., FP8, INT8 for KV cache in vLLM, TensorRT LLM, LMDeploy) reduces the memory footprint of the KV cache itself, making it feasible to store more tokens in GPU memory.
  • Paged KV Caches: Most modern runtimes (vLLM, TensorRT LLM, LMDeploy) implement paged KV caches, which allocate memory in fixed-size blocks. This virtualized memory management ensures efficient utilization of GPU memory, minimizing fragmentation and allowing more flexible allocation for varying context lengths.

These strategies collectively enable more efficient large language model deployment even with demanding, long-context applications.



Read the original article

0 Like this
Comparing Inference LLM Runtimes serving Top
Share. Facebook LinkedIn Email Bluesky Reddit WhatsApp Threads Copy Link Twitter
Previous ArticleLarge Language Models Struggle With Reading Clocks
Next Article How to use the new Windows 11 Start menu, now rolling out

Related Posts

Artificial Intelligence

Large Language Models Struggle With Reading Clocks

November 10, 2025
Artificial Intelligence

Digital coworkers: How AI agents are reshaping enterprise teams

November 6, 2025
Artificial Intelligence

8 High-Demand AI Jobs in 2025

November 3, 2025
Add A Comment
Leave A Reply Cancel Reply

Top Posts

AI Developers Look Beyond Chain-of-Thought Prompting

May 9, 202515 Views

6 Reasons Not to Use US Internet Services Under Trump Anymore – An EU Perspective

April 21, 202512 Views

Andy’s Tech

April 19, 20259 Views
Stay In Touch
  • Facebook
  • Mastodon
  • Bluesky
  • Reddit

Subscribe to Updates

Get the latest creative news from ioupdate about Tech trends, Gaming and Gadgets.

About Us

Welcome to IOupdate — your trusted source for the latest in IT news and self-hosting insights. At IOupdate, we are a dedicated team of technology enthusiasts committed to delivering timely and relevant information in the ever-evolving world of information technology. Our passion lies in exploring the realms of self-hosting, open-source solutions, and the broader IT landscape.

Most Popular

AI Developers Look Beyond Chain-of-Thought Prompting

May 9, 202515 Views

6 Reasons Not to Use US Internet Services Under Trump Anymore – An EU Perspective

April 21, 202512 Views

Subscribe to Updates

Facebook Mastodon Bluesky Reddit
  • About Us
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms and Conditions
© 2025 ioupdate. All Right Reserved.

Type above and press Enter to search. Press Esc to cancel.