The jump looked wrong. The physics were real.
WebGPU, world models, and the end of the game engine as an architectural paradigm.
February 22, 2026I had a game engine open in one tab and a browser running a world model in the other.
The game engine had 847 lines of code to handle physics, collision detection, a scene graph, a rendering pipeline, texture atlases, a frame loop, an input handler, and a state machine for a game that wasn't even playable yet.
The browser tab had a transformer dynamics model predicting the next frame from the previous frame and the action I just took. I pressed the spacebar. The model generated a jump. The jump looked wrong. I pressed it again. The model decided I hadn't jumped. That was the only code: one compute shader dispatch per frame. The rest was latent space.
I closed the engine.
Not because it stopped working. Because the architecture it represents has already lost and most people writing game engines don't know it yet.
Here is what changed and why it matters to anyone who thinks carefully about where compute goes.
World models are not video generators. This is the mistake everyone makes when they first see Genie or Oasis. Video generators produce fixed trajectories. You give them a prompt, they produce a sequence of frames, the sequence is done. You are watching, not interacting. No state. No action. No counterfactual.
World models are different in a precise way. They model the conditional distribution: given the current state of the world and the action you took, what is the next state? That conditionality is everything. It means the model has internalized a physics simulation, a renderer, a game logic engine, and an asset pipeline, all inside its weights, learned from watching humans play. When Genie 2 generates the next frame, it is answering a causality question: "what does this world look like after this action, from this camera angle, with this lighting, given everything that has happened so far?"
The architecture underneath that answer: a video autoencoder compresses each frame into a latent representation. A transformer dynamics model, trained with a causal mask identical to the one used in language models, takes the sequence of past latent frames plus the current action and predicts the next latent frame. A decoder renders the latent back to pixels. The whole thing runs autoregressively, frame by frame, exactly like a language model generates tokens one at a time.
The game engine, in this framing, is not the software you write. It is the training data the model learned from. Millions of hours of gameplay, physics simulations, rendered environments. The model learned the rules by watching them get applied to pixels. It never saw the code.
And then WebGPU arrived.
Not in the theoretical sense. In the November 2025 sense: Chrome, Firefox, Edge, and Safari all shipping it by default, global coverage hitting 83%, and the entire constraint around what you could run in a browser evaporating almost overnight.
WebGPU is not WebGL with a new syntax. WebGL was a graphics API bolted onto the browser, originally designed for rendering, co-opted for ML via texture hacks and fragment shader abuse. A BERT inference that took 50ms natively took 800ms through WebGL because the abstraction was wrong. WebGPU starts from the GPU primitives: compute shaders with actual buffer access, shared workgroup memory, FP16 support, storage textures that you can write to from compute. It maps directly onto Vulkan, Metal, and DirectX 12 underneath. The browser is no longer a layer of indirection from the hardware. It is, for the first time, a real compute environment.
I wrote my first compute shader dispatch in WGSL to run a matrix multiplication. The speed was not surprising to me intellectually. I knew the numbers. It was still surprising to feel. The browser tab was running the same matmul I would have written in CUDA. On the same GPU. At comparable throughput.
The practical consequence: Transformers.js v4 running Llama 8B quantized at 41 tokens per second in a browser tab via WebGPU. ONNX Runtime Web running Stable Diffusion in browser. The Visionary paper, which I spent a weekend reading closely, running an MLP that generates 3D Gaussian Splatting parameters for every frame entirely via ONNX Runtime WebGPU, rendering millions of Gaussians at real-time framerates without a server, without a native app, without anything except a browser and a GPU.
That last one stopped me for longer than I expected.
3D Gaussian Splatting is a neural rendering technique that represents a scene as millions of small, oriented, semi-transparent ellipsoids, each with position, scale, rotation, color, and opacity. The original technique stores these Gaussians as static parameters fit to a fixed scene. The interesting extension, which Visionary is running, generates the Gaussian parameters dynamically from a neural network, frame by frame. The network takes the scene representation and the current timestamp, runs inference, and produces the Gaussian attributes for that frame. The renderer takes those attributes and rasterizes them. Every single frame, the scene geometry is synthesized from latent space. Not loaded from disk. Not queried from a scene graph. Generated.
This is what I mean when I say the architecture of the game engine has already lost. The game engine's job was to maintain explicit representations of world state and transform them according to explicit rules. Position. Velocity. Collision geometry. Material parameters. The engine managed all of it. The developer specified it. The renderer consumed it.
In the world model paradigm, none of those explicit representations exist. The world state is a latent vector. The physics are whatever the model internalized from training data. The renderer is the decoder. The developer's job is not to write rules. It is to describe the world, specify what it should look and feel like, and let the model figure out what the latent trajectory through action space should be.
"But the model generates wrong things sometimes." It does. The temporal consistency at the edges breaks. The model confabulates physics it hasn't seen before. Objects morph in ways that Newtonian mechanics would not endorse. Oasis generates a jump that looks wrong.
I know. I watched it happen. I also watched it happen in a browser tab, in real time, with no code I wrote for physics, collision, rendering, scene management, or asset loading. The jump looked wrong. The engine I was using before had 847 lines of code and its jump also looked wrong, for different reasons, for weeks.
The question is not whether world model output is currently perfect. The question is which trajectory closes the gap faster: neural rendering quality compounding with every Genie and Oasis iteration, or traditional engine codebases compounding with every developer year invested in explicit state management.
The neural rendering trajectory has Genie 1 generating 2D environments, Genie 2 generating quasi-3D at 720p, Genie 3 announced in August 2025 generating real-time text-to-world at 720p 24fps with minutes of coherent play. That is two years of iteration from 2D proof of concept to real-time interactive 3D world generation. The traditional engine trajectory has Unreal Engine 5.
I am not saying Unreal Engine 5 is going away next Tuesday. I am saying that the research timeline makes it unambiguous which direction the fundamental architecture is going, and anyone who is still thinking of world models as "interesting demos" is making the same mistake people made about neural networks in 2011: watching the capability and not watching the scaling curve.
The specific combination that I think is most underappreciated by people who know WebGPU but not world models, and by people who know world models but not WebGPU: the compute primitive that enables world model inference in the browser is the compute shader. Specifically, the ability to dispatch arbitrary parallel workloads on the GPU without going through the graphics pipeline. A forward pass through a transformer dynamics model is matrix multiplications, attention operations, and layer norms. All of these are expressible as compute shader dispatches in WGSL. All of them run on the user's GPU at close to native speed. The autoencoder that compresses frames to latent space runs in the browser. The decoder that renders latent back to pixels runs in the browser. The transformer that predicts the next latent from the last latent and the action runs in the browser.
No server. No API call. No cloud GPU. The user's GPU runs the world model locally, privately, at real-time framerates on hardware that already exists in the devices they own.
I ran the Visionary architecture locally, through WebGPU, on a machine with a mid-range GPU. The 3D Gaussian renderer hit 60fps on a scene that would have required significant CPU overhead on the legacy WebGL path. The MLP inference per frame was under 8 milliseconds. The total frame time, including render, was under 16 milliseconds. 60fps. In a browser tab.
I closed the browser. I opened the game engine. I looked at the 847 lines.
I know what this is. It is the last generation of a paradigm that took 40 years to build and is being replaced not by a better version of itself but by a fundamentally different answer to the question of what a game engine is for.
The engine exists to convert developer intent into rendered worlds.
The world model does the same thing. It just learned intent from data instead of implementing it in code.
The developer's job is not disappearing. It is changing. From "write the rules the world follows" to "describe the world you want and curate the data that teaches the model to follow it." That is a different kind of expertise. It is not easier. It is different.
The engineers who will build the most interesting things in the next three years are the ones who understand both sides of this simultaneously. The WebGPU side: compute shaders, WGSL, workgroup memory, buffer layouts, the ONNX Runtime WebGPU execution provider, the actual throughput characteristics of a transformer forward pass dispatched from a browser tab. The world model side: autoregressive latent diffusion, dynamics models, classifier-free guidance, the distillation techniques that get Genie from research speed to real-time, the failure modes of temporal consistency and how they are being attacked.
Most people know one or the other.
The people who know both are the ones building the thing that replaces the game engine. Right now. In browser tabs. With compute shaders and latent spaces and no physics code at all.
The jump looked wrong. The physics were real.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.