Skip to main content

71ms per forward pass. budget is 35ms. the hardware told me before i wrote a single line of code.

Building a serving system for video world models. The math forced every decision before I named a single abstraction.

April 18, 2026

That's the moment the entire architecture fell out. Not from design documents. Not from a whiteboard session with the team. From doing the math on a napkin and realizing that every decision I was about to make was already made.

Let me explain what I built and why the constraints forced it.

I've been working on a serving system for video world models -- specifically for robotics, where the model runs in a closed loop with physical hardware and latency isn't a nice-to-have. The robot needs a prediction of what happens next if it moves left. It needs it in under 50ms. It needs it continuously, at scale, across a fleet.

The first thing I had to accept: everything I knew about LLM serving is wrong for this workload.

In an LLM you generate one token per iteration. The KV cache grows by one entry each step. The sequence gets longer. The bottleneck moves from compute to memory as context grows. vLLM's PagedAttention was designed for exactly this shape. It is a 1D problem with a 1D solution.

A video world model is a Diffusion Transformer operating on a 3D latent. The pipeline: camera frames → 3D VAE encoder → latent tensor of shape (T, H, W, C) → denoising loop → action head → motor commands. 16 frames of 256×256 video become roughly 16,384 latent tokens. The DiT runs N full forward passes over the entire latent, every step. You are not growing a sequence. You are refining a fixed-size 3D object. N times. The KV cache isn't the bottleneck in the LLM sense. The bottleneck is that you are doing N full forward passes and each one has to be under budget.

Here is the math that forced everything.

Target: 50ms end-to-end P99. I budgeted 35ms for actual model compute after accounting for network, VAE encode, conditioning, action decode, and serialization. The overhead is 15ms on a well-optimized path. That leaves 35ms.

A 3B parameter DiT doing one forward pass over 16K tokens at FP8 on an H100: roughly 98 TFLOPs. At 70% real utilization -- not the spec sheet number, the real one -- that's 71ms per step.

71ms. Budget is 35ms. Before I had named a single abstraction.

That calculation forces three non-negotiable commitments in sequence:

You must distill. Classical diffusion needs 20-50 denoising steps. 2 steps at 15ms each is 30ms -- tight but workable. Consistency distillation or rectified flow shortcuts to get from 20 steps to 2. There is no other path. The math doesn't negotiate.

You must shard. 15ms per step means ~4.7× speedup over single-GPU. Tensor parallelism across 4 H100s connected by NVLink, with ~15-20% all-reduce overhead, gets you to roughly 3.3×. Tight. FP8 everywhere. Still tight. Works.

You must exploit causal caching across chunks. The world model generates video in chunks -- frames 0-15, then conditioned on those to generate 16-31, and so on. Within a single chunk's denoising loop, the KV cache for all prior chunks' tokens doesn't change. Only the current chunk's tokens are being denoised. That means across your 2 denoising steps, you only recompute KV for the tokens you are actively working on. For a robot that's been running 30 seconds, you might have 10× more context tokens than current-chunk tokens. This saves ~80% of attention work. It's the single highest-leverage optimization in the system and it falls entirely out of the structure of the problem.

The attention kernel is where the interesting engineering lives.

vLLM's PagedAttention stores KV blocks as 1D slices: 16 tokens × n_heads × head_dim. The block table is flat. The attention kernel reads blocks sequentially.

Video attention is 3D. The natural block shape is a 3D tile -- say, 4 timesteps × 8×8 spatial patches = 256 tokens per block. The block table is now 3D: (sequence_id, t_idx, y_idx, x_idx) → physical_block_ptr. And the access pattern is structured by how the attention factorizes.

For spatial attention within frame t, you need all blocks where the time component is t -- a plane through the 3D block grid. For temporal attention at spatial position (y, x), you need all blocks along the time axis at that column. For windowed 3D attention, you need a neighborhood cube. These are fundamentally different access patterns from a flat sequence, and they don't map cleanly onto any existing attention kernel.

I wrote a custom Triton kernel that takes a 3D block table and an access pattern descriptor -- spatial, temporal, or windowed -- and computes attention over only the blocks the pattern touches. The memory layout matters enormously: blocks that'll be read together need to be physically adjacent in HBM for coalesced reads. I allocate a contiguous HBM arena per (sequence, time-range) tuple and lay out spatial blocks within each time slab in Z-order curves so spatial neighbors are memory neighbors.

This also enables three things vLLM can't do:

Eviction in 3D -- when context grows beyond the memory window, evict whole time slabs rather than trying to maintain 1D causality. Matches how the model actually uses context.

Mixed precision by distance -- recent blocks in FP8, distant blocks in FP4. vLLM can't express "recency" in its block representation because there is no such concept in a 1D sequence.

Scene-level sharing -- two robots in the same environment genuinely share early-frame KV when the scene is static. Copy-on-write from vLLM carries over, but the block equivalence check is at the 3D level.

The scheduler is diffusion-aware in a way nothing in open source is yet.

Standard continuous batching works because every active sequence is doing the same thing each iteration: decode one token. You can pack sequences at different generation positions into one batch trivially.

Diffusion breaks this. Different requests are at different denoising steps. The model is conditioned on the step index via AdaLN -- you can't mash step-2 and step-5 requests into one forward pass without handling step conditioning per sequence.

The approach: step-homogeneous micro-batching with SLO-aware admission. I maintain N queues, one per denoising step. Each GPU replica picks the queue whose requests are closest to SLO breach, drains as many as fit in memory, runs that forward pass, advances each request. Requests at the final step exit. The rest move to the next queue.

The admission controller does deadline math. Request arrives with a 40ms deadline. 2 steps × 15ms = 30ms of work. Queue depth suggests 15ms before it starts. 45ms total. Misses deadline. Shed to 1-step model. The robot client gets the quality tier in the response and can decide whether to retry or accept the lower-quality prediction. This is EDF scheduling applied to an ML workload and no production inference stack does it.

One more: mixed step batching. If the model uses AdaLN for timestep conditioning, you can batch requests at different denoising steps by broadcasting different timestep embeddings per sequence. Same forward pass, different conditioning. ~20-30% utilization gain. StreamDiffusion does this for image generation. Nobody has shipped it for video world models.

The disaggregated topology has four independently-scaling pools.

VAE encoder on L40S -- small, compute-bound, cheap. Never burn H100 time on this.

DiT denoising on H100/H200 with NVLink. 4-GPU TP groups. KV cache lives here.

Conditioning pool for text prompts, action histories, camera parameters. CPU or small GPUs. Pre-encodes conditioning and ships it to the DiT pool via RDMA in under 1ms.

Decoder pool -- polymorphic. Robotics customers run the action head (tiny, often on CPU). Video streaming customers run the VAE decoder back to pixels. Same DiT backbone, different decoder.

The KV cache memory hierarchy: L1 in HBM (last ~5 seconds, FP8), L2 in host DRAM (last ~30 seconds, FP4, paged back in ~50ms), L3 in distributed cache (full session history, used for session resumption and training data). Weights live in HBM always. No weight offloading. You cannot afford the latency.

I'll be honest about where I landed.

Sub-50ms P99 at 1-2K concurrent robots on 128 H100s is achievable. 10K concurrent is aspirational -- requires either a smaller base model or a much bigger cluster. 85% GPU utilization is achievable with the scheduler and disaggregation. 99.99% availability is a 12-month engineering project on its own, and it mostly comes from the degradation paths -- when queue depth spikes, routes new requests to 1-step model, then to previous-generation distilled model on smaller GPUs -- not from making any single component more reliable.

The genuinely novel pieces are the spatiotemporal attention kernel and the diffusion-aware scheduler. Neither exists well in open source. Everything else is good engineering applied to a new workload.

(The claim of "first system" is wrong. 1X, Figure, and NVIDIA's Cosmos teams have internal versions of this. What would make a new system matter is the open interface and multi-tenant economics -- "vLLM for world models" is the right framing, not "fastest inference in the world.")

the math told me the architecture before i designed anything.

71ms per forward pass. 35ms budget. distill, shard, cache -- in that order, not optional, not negotiable.

the hardware was already done with the design meeting before i scheduled it.

if you're building robotics infrastructure or working on world model serving and want to talk through any of this, write to me. the spatiotemporal attention kernel is the part that took the longest and the part i'm most interested in feedback on.

P.S. The phase structure matters more than any individual technical decision. Phase 1 is always "single replica, no paging, measured." Most teams skip this and pay for it forever because they don't know the true cost structure before they build the optimization. Two months of benchmarking a working but dumb system will tell you more than four months of building a clever one without a baseline.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.