Skip to main content

the transformer isn't dying. it's getting a co-pilot.

Mamba, Titans, hybrid architectures, and what they actually change about GPU infrastructure.

February 2, 2026

I spent the better part of three weeks reading architecture papers trying to understand if Mamba, Titans, and the hybrid models actually change how I think about GPU infrastructure.

The answer is yes. But not in the way most people are describing it.

The takes I keep seeing frame this as a competition -- SSMs vs transformers, new vs old, the death of attention. That framing is wrong and it's making people miss what is actually interesting about what's happening right now. Let me try to say it more carefully.

Start with Mamba, because it's the cleanest case study in what these architectures actually do to hardware.

A transformer generates tokens autoregressively. Each new token requires the model to attend over every previous token -- which means reading the entire KV cache from HBM on every step. The KV cache grows with every token generated. The memory bandwidth requirement grows with it. This is the memory wall I've written about before: decode is memory-bound, the GPU sits idle waiting for data movement, your $30,000 H100 runs at 4% compute utilization.

Mamba replaces the KV cache with a fixed-size hidden state. Instead of storing every previous token and attending over all of them, the SSM compresses the entire sequence history into a constant-size representation that gets updated recurrently. The memory footprint at inference doesn't grow. A 220K token sequence and a 2K token sequence have identical memory requirements at decode time. That is a real architectural advantage. It is not a solved problem.

Here's the thing nobody is saying clearly: the hidden state update is still memory-bound.

You replaced one memory-bound operation with a different memory-bound operation. The SSM state update is an outer-product computation -- loading the state, loading the input, writing the updated state. The arithmetic intensity is low. The GPU is still waiting for memory. The wall moved. It didn't disappear. For sequences where the KV cache was the bottleneck -- very long contexts -- Mamba wins. For shorter sequences where both architectures are within manageable memory budgets, the transformer's precision often wins on quality. You traded one constraint for a different constraint at a different sequence length.

Mamba-3 understands this, which is why it's the first version I think is genuinely interesting from an infrastructure perspective.

The MIMO upgrade -- switching from single-input single-output to multi-input multi-output state updates -- converts the outer-product computation into a matrix multiplication. That is not a small change. Matrix multiplications are what tensor cores are built for. You increased the arithmetic intensity of the state update by restructuring the computation graph. The GPU stops waiting for memory and starts doing math. This is the exact same move FlashAttention made for transformers in 2022 -- not a new algorithm, a hardware-aware reimplementation of an existing algorithm that moves the operation from the memory-bound to the compute-bound regime. Mamba-3 applied that same insight to SSMs. The "cold GPU problem" -- hardware sitting idle during decode because memory movement dominates -- is what Mamba-3 specifically targets.

That is an infrastructure paper wearing a research paper's clothes.

Titans is weirder and more interesting and I'm still not sure what to do with it from a deployment perspective.

Google's architecture gives the model three types of memory operating simultaneously. Short-term memory is attention -- precise, expensive, limited to the current context window. Long-term memory is a small MLP that updates its weights during the forward pass based on a "surprise metric." Tokens that are unexpected relative to what the model has seen get memorized. Routine, predictable tokens get compressed or discarded. Persistent memory is fixed -- the weights from training that don't change at inference.

The thing that should stop you: the long-term memory module is running gradient descent at inference time.

A small MLP is updating its own weights on every forward pass based on how surprising the input is. This is not fine-tuning. This is test-time training embedded inside a single inference call. From a GPU scheduling perspective, you now have a workload that looks like training -- weight updates, gradient computations -- happening inside what your infrastructure believes is an inference request. The memory access pattern is different. The compute pattern is different. The thermal profile is different. The standard inference serving assumptions -- fixed model weights, stateless between requests, constant memory footprint per sequence -- none of them hold cleanly.

Titans outperforms GPT-4 on BABILong at a fraction of the parameter count. 2 million token context. Those numbers are real. The deployment question is: what does your inference infrastructure look like when the model is modifying its own weights while serving a request.

I don't have a clean answer. I have a lot of questions about memory isolation between concurrent requests, about what happens to the memory module state between requests from the same user, about whether the surprise-metric learning is deterministic enough to be reproducible. These are not research questions. They are infrastructure questions that nobody has answered publicly yet because nobody has deployed this at scale publicly yet.

The thing I'm most confident about is the hybrid result, because the ablation data is unambiguous.

Nemotron-H replaced 92% of attention layers with Mamba2 blocks. Three times the throughput of LLaMA at comparable size. Jamba 1.5 -- 398 billion total parameters, 94 billion active -- runs 256K context on hardware that couldn't handle that with pure attention. These are not benchmarks. These are production models from NVIDIA and AI21 with open weights you can run.

The interesting finding is the retrieval ablation. When researchers removed the attention layers entirely from hybrid models and replaced them with Mamba, retrieval accuracy dropped to zero. Not degraded. Zero. Mamba layers contribute nothing to needle-in-a-haystack retrieval. The attention layers are doing the entire job of precise information lookup.

What this means: attention and Mamba are not doing the same thing in these models. They are not interchangeable components where one is more efficient than the other. They are specialized modules solving different subproblems. Mamba handles bulk sequence processing -- compression, pattern recognition across long ranges, maintaining coherent state across hundreds of thousands of tokens. Attention handles precision retrieval -- finding the specific token or fact that matters right now, in the current context. The hybrid architecture is not a compromise between two approaches. It is a specialization that gives each module the workload it's actually good at.

The ratio that keeps appearing in the literature: one attention layer for every seven to ten Mamba layers. That ratio is not arbitrary. It reflects how often precision retrieval is required relative to bulk processing in typical language tasks. Different tasks will want different ratios. Code generation with heavy API lookup might want more attention. Long document summarization might want less. This is a new tunable parameter in model architecture that infrastructure engineers are going to need opinions about.

The GPU engineer conclusion, stated as plainly as I can:

SSMs moved the memory wall -- they didn't remove it. The work Mamba-3 did on arithmetic intensity is the right direction and it directly parallels the FlashAttention work that transformed transformer inference. The hybrid architectures are real and shipping and the throughput improvements are not marginal. Titans is doing something genuinely different with test-time weight updates and nobody has publicly solved the deployment questions that creates.

The transformer is not being replaced. It's being used more precisely -- at the layers where attention is irreplaceable, combined with architectures that handle everything else more efficiently.

That is a more interesting outcome than one architecture winning.

the roofline model doesn't care what you call the architecture.

memory-bound is memory-bound. compute-bound is compute-bound.

the question for every new architecture is the same question it's always been: where does this operation land on the roofline, and what would it take to move it right.

mamba-3's answer to that question is better than mamba-2's. that's why it matters.

the hardware doesn't know it's supposed to be impressed. it just runs the kernels you give it.

i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.

no spam. no sequence. just the note, when it exists.