the frame budget is 16 milliseconds. it does not negotiate.
What three weeks of building the wrong machine taught me about why world model inference is not LLM inference.
January 9, 2026It was 11:17pm. I had been staring at a world model serving stack for three weeks trying to make it behave like vLLM. It didn't. It kept breaking in ways that took me a week each to understand.
Week three is when I finally admitted the problem.
I was building the wrong machine.
Not because world models are harder. Because they are a different problem entirely. And I had spent three weeks applying LLM inference intuition to something that shares a transformer backbone and almost nothing else.
Here is what I learned. Slowly. The expensive way.
An LLM generates tokens. Discrete. Small. One per forward pass. The user tolerates 100 milliseconds between tokens. Maybe 150. The stream feels slow but the application survives.
A world model generates frames. A single frame at 720p is roughly 2,500 visual patches encoded into continuous latent space. Not discrete. Not small. And a diffusion-based world model does not generate a frame in one forward pass. It runs 25 denoising steps per frame. Twenty-five full forward passes through the transformer. To produce one frame.
The latency the user tolerates: 16.67 milliseconds. At 60fps.
That is not a soft preference. It is a wall. A world model that takes 50ms per frame runs at 20fps. Players feel it immediately. 100ms per frame is 10fps. The interactive experience breaks. Not degrades. Breaks.
An LLM can get slower as the context grows. Users notice, but the application keeps working.
A world model that gets slower as the session progresses is a game that becomes unplayable over time. The latency SLO is hard in a way that almost nothing in LLM serving is.
I did not understand this when I started. I do now.
The KV cache was where I wasted the most time. It looked like the same problem. It wasn't.
In a language model, the KV cache stores the key and value projections for every token the model has seen. It grows linearly with sequence length. PagedAttention treats it like virtual memory. SGLang's RadixAttention trees it for prefix sharing across requests. You can evict old tokens aggressively. Losing some cached context makes the output slightly worse. The application tolerates it.
I tried to apply the same eviction logic to a world model's temporal cache.
The world model started generating rooms that changed color mid-session. Objects that had been on the left appeared on the right. A door that the user had opened closed itself three seconds later.
"But you can just keep more of the—" No. The cache grows quadratically with history if you keep everything. At 60fps over 10 seconds, you have 600 frames of latent history. You cannot attend over all of it within the frame budget.
The answer the research arrived at is a rolling KV cache. Fixed-size window. New frames appended. Oldest frames evicted. O(TL) instead of O(T²). The model learns to work within this bounded context. But here is the part I missed: the rolling cache only works if the model was trained with it. If you take a model trained on full history and serve it with a rolling cache, the distribution mismatch breaks temporal coherence. The cache design is a training decision, not an inference decision.
I learned this at 1am on a Tuesday by watching a generated forest turn into a generated ocean over 40 seconds of play. Nothing in my vLLM experience prepared me to debug that.
Then there is exposure bias. This is the one that nobody from the LLM world talks about because LLMs mostly don't have it.
When you train a world model with teacher forcing, you give it perfect, ground-truth frames as context. Frame 1 is real. Frame 2 is generated conditioned on real frame 1. Frame 3 is generated conditioned on real frame 2. The model learns to predict from clean inputs.
At inference, frame 1 is real. Frame 2 is generated from frame 1. Frame 3 is generated from frame 2, which already has small errors. Frame 4 from frame 3, which has slightly larger errors. Each step, the model is conditioning on a context it never saw during training: its own imperfect outputs. The errors compound.
By frame 30, you have visual collapse. Motion stagnation. Scene freezing. The model generates the same frame repeatedly because the accumulated errors have pushed the latent trajectory into a degenerate attractor.
This does not happen in LLM inference. Not like this. The discrete token space and the scale of language pretraining make LLMs robust to their own errors in a way that world models are not.
The fix is not an inference optimization. It is a training paradigm change. Self-Forcing, NeurIPS 2025 Spotlight, trains the model on its own generated rollouts with KV caching running during training. The model learns to recover from its own errors. It is supervised on the quality of the entire generated sequence, not frame by frame against ground truth. After training this way, the model at inference is already familiar with the kind of imperfect context it will see. The errors still exist. They stop compounding.
"But can't you just noise the context frames at inference to—" People tried this. It complicates the KV cache design, increases latency, and does not resolve the fundamental distribution mismatch. It is a patch on a structural problem.
The paper that got this right spent six months on the training loop. Not the inference engine. The inference engine is downstream of that decision.
Then I tried to use continuous batching.
Continuous batching is the core of vLLM. New requests arrive asynchronously, are integrated into an existing batch mid-sequence, and the GPU stays saturated across many concurrent users. The optimization is toward throughput: tokens per second across all users simultaneously. The more users you batch, the more efficient the hardware.
I built a continuous batching scheduler for the world model serving stack. It did not help. It made things worse.
Interactive world model inference is one user at a time per world instance. Each user is in a unique world state from the moment they take their first action. There is no prefix sharing between worlds. You cannot batch user A's generated ocean with user B's generated forest. Their latent histories diverged at frame 2. The continuous batching logic adds scheduler overhead to solve a concurrency problem that does not exist in the workload.
The economic pressure inverts completely. An LLM engine asks: how many users can we serve on this hardware simultaneously. A world model engine asks: can this single user's world stay coherent at 60fps for the next ten minutes. Different question. Different machine. Different hardware sizing.
I scrapped the scheduler after two weeks. Built a simpler loop. One session, one forward pass per frame, rolling KV cache, hard 16ms frame budget enforced with a timeout that drops denoising steps if the budget is exceeded. Fewer denoising steps means slightly lower visual quality. Missing the frame budget means the game breaks.
I chose quality every time. The alternative is a technically sophisticated system that produces an unplayable experience.
The last piece: distillation is not quantization.
In LLM serving, the primary throughput lever is precision reduction. INT8, FP8, INT4. You compress the weights, increase the batch size that fits in VRAM, serve more users per GPU. The quality tradeoff is measured in perplexity or benchmark scores. Usually small enough to accept.
In world model serving, the primary throughput lever is step reduction. You take a model that runs 25 denoising steps per frame and distill it into a model that runs 1 to 4 steps. Distribution Matching Distillation. Consistency distillation. Self-Forcing's best checkpoint runs at 17 frames per second on a single H100 at 480p.
The quality tradeoff is visual. You see it. Users see it. But a world model running at 17fps beats a world model running at 2fps on visual fidelity by a margin no quantization could recover.
These are not the same lever. The engineer who knows LLM inference deeply and does not know world model inference will reach for quantization first and wonder why the latency is still broken.
I did this. Not proud of it. Three weeks.
here is the thing nobody said clearly before I started.
an llm engine asks how many users can share this hardware.
a world model engine asks whether one user's world holds together for ten minutes.
different question. different bottlenecks. different failures. different fixes.
if you come from vllm and try to build a world model serving stack, you will spend three weeks learning this the same way i did.
or you can read this and spend three weeks on something harder.
the frame budget is 16 milliseconds. it does not negotiate.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.