#performance

26 posts

The State of FP8 KV-Cache and Attention Quantization in vLLM

Apr 22, 2026·21 min read

Long-context LLM serving is increasingly memory-bound: for standard full-attention decoders, the KV cache often dominates GPU memory at 128k+ contexts, and each decode step must read a large...

Model Runner V2: A Modular and Faster Core for vLLM

Mar 24, 2026·8 min read

We are excited to announce Model Runner V2 (MRV2), a ground-up re-implementation of the vLLM model runner. MRV2 delivers a cleaner, more modular, and more efficient execution core—with no API...

P-EAGLE: Faster LLM inference with Parallel Speculative Decoding in vLLM

Mar 13, 2026·12 min read

EAGLE is the state-of-the-art method for speculative decoding in large language model (LLM) inference, but its autoregressive drafting creates a hidden bottleneck: the more tokens that you...

vLLM Triton Attention Backend Deep Dive

Mar 4, 2026·10 min read

This article is adapted from a Red Hat hosted vLLM Office Hours session with Burkhard Ringlein from IBM Research, featuring a deep technical walkthrough of the vLLM Triton attention backend....

Beyond Porting: How vLLM Orchestrates High-Performance Inference on AMD ROCm

Feb 27, 2026·19 min read

For a long time, enabling AMD support meant "porting"; i.e. just making code run. That era is over.

Efficiently serve dozens of fine-tuned models with vLLM on Amazon SageMaker AI and Amazon Bedrock

Feb 26, 2026·11 min read

Organizations and individuals running multiple custom AI models, especially recent Mixture of Experts (MoE) model families, can face the challenge of paying for idle GPU capacity when the...

DeepSeek-V3.2 on GB300: Performance Breakthrough

Feb 13, 2026·12 min read

DeepSeek-V3.2 (NVFP4 + TP2)has been successfully and smoothly run on GB300 (SM103 - Blackwell Ultra). Leveraging FP4 quantization, it achieves a single-GPU throughput of 7360 TGS (tokens / GPU /...

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

Feb 3, 2026·10 min read

Building on our previous work achieving 2.2k tok/s/H200 decode throughput with wide-EP, the vLLM team has continued performance optimization efforts targeting NVIDIA's GB200 platform. This blog...

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

Feb 1, 2026·8 min read

TL;DR: In collaboration with the open-source community, vLLM + NVIDIA has achieved significant performance milestones on the gpt-oss-120b model running on NVIDIA's Blackwell GPUs. Through deep...

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

Jan 8, 2026·15 min read

In this post, we will describe the new KV cache offloading feature that was introduced in vLLM 0.11.0. We will focus on offloading to CPU memory (DRAM) and its benefits to improving overall...

vLLM-Omni Diffusion Cache Acceleration

Dec 19, 2025·4 min read

We are thrilled to announce a major performance update for vLLM-Omni.

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

Dec 17, 2025·8 min read

In v0.11.0, the last code from vLLM V0 engine was removed, marking the complete migration to the improved V1 engine architecture. This achievement would not have been possible without vLLM’s...

Shared Memory IPC Caching: Accelerating Data Transfer in LLM Inference Systems

Nov 13, 2025·7 min read

Introducing Shared Memory IPC Caching — a high-performance caching mechanism contributed by Cohere to the vLLM project. By bypassing redundant inter-process communication and keeping large...

No More Train-Inference Mismatch: Bitwise Consistent On-Policy Reinforcement Learning with vLLM and TorchTitan

Nov 10, 2025·6 min read

We demonstrate an open-source bitwise consistent on-policy RL run with TorchTitan as the training engine and vLLM as the inference engine. Built on top of vLLM's recent work on batch-invariant...

Zero-Reload Model Switching with vLLM Sleep Mode

Oct 26, 2025·17 min read

The multi-model serving problem: You have two LLMs that each fit on your GPU, but not both at once. Traditional solutions force a bad tradeoff:

SemiAnalysis InferenceMAX: vLLM and NVIDIA Accelerate Blackwell Inference

Oct 9, 2025·8 min read

Over the past several months, we’ve been collaborating closely with NVIDIA to unlock the full potential of their latest NVIDIA Blackwell GPU architecture (B200/GB200) for large language model...

Introduction to torch.compile and How It Works with vLLM

Aug 20, 2025·14 min read

Fast large language model (LLM) inference today requires executing models as efficiently as possible across diverse hardware, workloads, and scale. Efficient execution requires heavily optimized...

vLLM Now Supports gpt-oss

Aug 5, 2025·5 min read

We're thrilled to announce that vLLM now supports gpt-oss on NVIDIA Blackwell and Hopper GPUs, as well as AMD MI300x and MI355x GPUs. In this blog post, we’ll explore the efficient model...

MiniMax-M1 Hybrid Architecture Meets vLLM: Long Context, Fast Inference

Jun 30, 2025·6 min read

This article explores how MiniMax-M1's hybrid architecture is efficiently supported in vLLM. We discuss the model's unique features, the challenges of efficient inference, and the technical...

vLLM V1: A Major Upgrade to vLLM's Core Architecture

Jan 27, 2025·11 min read

We are thrilled to announce the alpha release of vLLM V1, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the past 1.5 years of vLLM development, we revisited key...

Structured Decoding in vLLM: a gentle introduction

Jan 14, 2025·12 min read

- Structured decoding allows precise control over LLM output formats - vLLM now supports both outlines and XGrammar backends for structured decoding - Recent XGrammar integration brings up to 5x...

Serving LLMs on AMD MI300X: Best Practices

Oct 23, 2024·15 min read

TL;DR: vLLM unlocks incredible performance on the AMD MI300X, achieving 1.5x higher throughput and 1.7x faster time-to-first-token (TTFT) than Text Generation Inference (TGI) for Llama 3.1 405B....

How Speculative Decoding Boosts vLLM Performance by up to 2.8x

Oct 17, 2024·10 min read

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in...

vLLM v0.6.0: 2.7x Throughput Improvement and 5x Latency Reduction

Sep 5, 2024·12 min read

TL;DR: vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model.

Notes on vLLM v.s. DeepSpeed-FastGen

Nov 14, 2023·4 min read

- vLLM matches DeepSpeed-FastGen's speed in common scenarios and surpasses it when handling longer outputs. - DeepSpeed-FastGen only outperforms vLLM in scenarios with long prompts and short...

vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention

Jun 20, 2023·8 min read

LLMs promise to fundamentally change how we use AI across all industries. However, actually serving these models is challenging and can be surprisingly slow even on expensive hardware. Today we...