Skip to main content

two models shipped this month that broke a rule everyone believed about memory and capability.

Gemma 4 E2B runs in a browser tab. Nemotron 3 Super runs 1M context on a single GPU. Neither should be possible.

April 17, 2026

One runs in a browser tab with no server.

One runs on a single GPU with a 1 million token context window.

Neither should be possible given what we knew six months ago about the relationship between model capability and memory requirements. I want to explain the architecture decisions that made both of them work, because they are solving the same problem from opposite directions and almost nobody has written about them together.

Start with Gemma 4 E2B, released April 2nd. The "E" stands for effective parameters. The model has 5.1 billion total parameters but only 2.3 billion effective ones -- and the distinction is not marketing. It is a specific architectural decision called Per-Layer Embeddings that changes how the memory math works.

Standard transformers have one embedding table. Every token in the vocabulary gets a vector, the same vector at every layer. That table sits in VRAM. The transformer weights sit in VRAM. All of it competes for the same GPU memory budget.

PLE gives every decoder layer its own small embedding table. Each layer gets a secondary embedding signal injected per token -- a different learned representation at layer 1 vs layer 12 vs layer 24. The result is that the model has far richer representational capacity than its 2.3B effective parameter count suggests, because every layer is conditioning on both its weight-based computation and its own learned embedding of the current token.

Here is the part that makes this genuinely weird: those per-layer embedding tables are large -- they account for the difference between 2.3B effective and 5.1B total -- but they are accessed via lookup, not via matrix multiply. A lookup table access on GPU is cheap and parallelizable. And critically, for the on-device use case, those embedding tables can sit in system RAM while the core transformer weights sit in GPU VRAM. The accelerator sees 2.3B parameters. The system memory holds the rest.

Chrome tabs in 2026 typically have access to roughly 4GB of GPU VRAM. An E2B model with 5.1B total parameters and 4-bit quantization would be ~2.5GB -- right at the edge of what Chrome can hold. But with PLE separating fast-access embedding tables from accelerator-resident transformer weights, the effective VRAM footprint drops well below that line. The E2B ships in a 500MB package for WebGPU deployment. Five hundred megabytes. Running in a browser tab. With 128K context. Doing vision, text, and audio.

Transformers.js has ONNX weights for it already. The Gemma-Gem Chrome extension runs a full browser agent -- page reading, DOM interaction, form filling, JavaScript execution -- entirely locally, zero network calls, on hardware anyone bought last year.

That is not a demo. That is production.

Now Nemotron 3 Super, released March 11th. 120 billion total parameters. 12 billion active per forward pass. 1 million token context window. Runs on a single H200.

The number that should not be possible: 1 million tokens on a single GPU.

Standard attention scales quadratically with context length. Double the context, quadruple the compute and KV cache memory. At 1 million tokens, a standard transformer's KV cache alone would dwarf the model weights. It would require multiple high-end GPUs just to hold the cache. This is the memory wall that makes long-context inference on real hardware nearly theoretical.

Nemotron 3 Super uses a hybrid architecture: 75% of layers are Mamba-2 state space model layers, 25% are standard attention layers. SSM layers process sequences in linear time. Instead of attending over every previous token, they maintain a compressed recurrent hidden state that gets updated as new tokens arrive. That state is fixed-size regardless of context length. At 1 token, the SSM cache is a certain number of bytes. At 1 million tokens, it is the same number of bytes.

The 25% attention layers still grow a KV cache quadratically with context. But 25% of quadratic is substantially less than 100% of quadratic. The attention layers handle the precise associative recall that pure SSMs struggle with -- finding one specific fact in a haystack of context. The Mamba layers handle the heavy lifting of long-sequence memory. The two complement each other architecturally: SSMs for capacity, attention for precision.

The practical result: a 120B parameter model where the KV cache at 128K tokens fits in the memory headroom of a single H200 alongside the weights themselves. At 1M tokens the math gets harder, but the point is the scaling curve is no longer the exponential cliff it would be for a pure transformer.

On top of this, Nemotron 3 Super is natively pretrained in NVFP4 -- not quantized after training, trained in 4-bit floating point from the start. Post-hoc quantization always introduces accuracy degradation because the model learned at high precision and is then compressed. Native NVFP4 pretraining means the model learned to be accurate under 4-bit arithmetic constraints from the first gradient update. The result is BF16-class accuracy at 4-bit memory and compute cost. On Blackwell, that is a 4x inference speed improvement over FP8 on Hopper.

It also has LatentMoE -- before tokens reach the expert networks, they are projected into a compressed latent space for routing. This lets the model activate 4x more experts at the same compute cost compared to standard MoE routing. More experts contributing to each token means higher quality per forward pass without proportional VRAM or compute increase.

Plus native multi-token prediction, which functions as built-in speculative decoding without a separate draft model -- the model predicts multiple future tokens per pass inherently, because it was trained that way.

The thing I want to sit with: these two architectures are solving the same root problem from opposite ends.

Gemma 4's PLE is saying: not all parameters need to live on the accelerator. Some parameters -- specifically, embedding tables that are accessed via lookup rather than matrix multiply -- can live in system memory and be pulled into the compute path cheaply. Split the memory hierarchy deliberately, by parameter type, and you buy yourself accelerator headroom for the parameters that actually need to be there.

Nemotron 3's SSM hybrid is saying: not all context needs to grow a quadratic cache. The memory that accumulates as you process longer sequences -- replace most of it with a fixed-size recurrent state, and the memory wall stops being a wall.

Both of them are saying: the assumption that capability scales proportionally with memory footprint is wrong, and we built architecture to prove it.

The conventional wisdom was: bigger model means more memory. More context means more memory. These are true for standard transformers. They are increasingly not true for the architectures shipping in 2026.

What this means for the on-prem and on-device question, which is where most of the interesting deployment decisions are being made right now:

A Gemma 4 E2B running in WebGPU on a user's laptop is inference that costs zero marginal compute, has zero latency for the network hop, has zero data privacy risk, and works offline. The quality ceiling is lower than a 70B cloud model -- but for a substantial class of tasks (document summarization, form extraction, coding assistance, local search), the quality is sufficient and the deployment economics are incomparably better.

A Nemotron 3 Super running on a single H200 on-prem is 12B active parameters, 1M context, frontier reasoning capability, fully air-gapped, for the cost of owning one GPU server. For enterprises where data sovereignty is non-negotiable -- legal, medical, financial, government -- this is the first time a single on-prem GPU can run a model with the context and capability to handle production agentic workloads.

Six months ago neither of these statements was true. The browser inference story was "small models, limited context, toy quality." The single-GPU story was "you can run inference but not frontier-class reasoning at meaningful context lengths."

Both changed in the last 45 days.

the memory wall isn't gone.

it bent.

gemma 4 bent it for the browser by splitting parameters across the memory hierarchy by type. nemotron 3 bent it for on-prem by replacing quadratic context scaling with linear for most of the stack.

two architectures. same insight. the relationship between capability and memory is not fixed -- it is a design choice.

the interesting inference deployments of 2026 are not the ones running on 288-gpu clusters. they are the ones running on hardware you already own, in browsers that cost nothing, doing things that weren't possible three months ago.

P.S. The vLLM chunked prefill interaction with Nemotron 3 Super's SSM layers is a real production gotcha -- SSM layers cannot correctly initialize their recurrent state across chunk boundaries without special handling, so you must pass --no-enable-chunked-prefill until your specific vLLM version has validated support. Enabling chunked prefill on a hybrid SSM-Transformer model without this check is not a performance issue. It is a correctness issue. Your outputs will be wrong and the failure mode is silent. Verify your vLLM version before deploying.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.