the H100 was designed for something most kernels don't do.
Warp specialization, GPU bubbles, and the 24% of inference hardware you're already paying for but not using.
April 5, 2026I have been trying to explain warp specialization to a colleague for two weeks and I keep failing. Not because the concept is impossible to explain -- it isn't -- but because every explanation I give assumes the listener already knows something they don't, and when I back up to fix that I assume something else they don't, and eventually I'm explaining what a warp is and I've lost the original thread entirely.
Let me try again here, more carefully, because I think this is the most important performance gap in production GPU inference that almost nobody is talking about.
Start with the standard model of GPU execution.
A kernel launches. Threads are organized into warps of 32. Each warp executes the same instruction on different data -- that is what SIMT means, single instruction multiple thread. On every clock cycle, the warp scheduler picks a ready warp and issues its next instruction. A warp is "ready" when it has data to work with. When a warp is waiting for data to arrive from HBM -- which takes hundreds of clock cycles -- the scheduler switches to a different warp. This is latency hiding: you tolerate the memory latency by having enough other warps to fill the clock cycles while some warps are waiting.
This model works. It has worked since NVIDIA introduced CUDA in 2007. It is the mental model almost everyone who programs GPUs carries around.
It is also not how the highest-performance kernels on Hopper and Blackwell work.
Hopper shipped with a feature called TMA -- Tensor Memory Accelerator. A dedicated hardware unit that handles bulk data movement from HBM to shared memory asynchronously, independently of the SM's compute units. While the TMA is loading data into shared memory, the compute units can be doing something else.
This creates a new possibility that the standard SIMT model doesn't capture: you can split the warps in a thread block by function. Some warps are designated as producers -- their job is to initiate TMA loads, wait for them to complete, and signal the consumers. Other warps are designated as consumers -- their job is to pull data from shared memory and run WGMMA (Warp Group Matrix Multiply Accumulate) instructions to do actual computation.
Producers and consumers run concurrently within the same thread block, synchronized through asynchronous barriers in shared memory.
This is warp specialization. And when it's implemented correctly, the compute and memory movement overlap completely -- while the consumers are computing on tile N, the producers are already loading tile N+1. The hardware is doing two things at once instead of one. The result is that you approach theoretical peak FLOP utilization even on kernels that are nominally memory-bound.
FlashAttention-3 uses this. ThunderKittens is built around it. The Tawa compiler automates it. The "Optimal SWP and WS" paper (December 2025, NVIDIA) formulates the joint optimization of software pipelining and warp specialization as a constraint satisfaction problem and solves it with off-the-shelf solvers -- because the current state of the art for figuring out the right warp split ratios and pipeline depths is "brittle compilation heuristics and fallible human intuition."
That last phrase is from the paper. They used those exact words. The people who designed the hardware are describing the current state of programming it as fallible human intuition.
Here is why this matters for inference specifically.
Decode is memory-bandwidth-bound. Each token generation requires loading the full model weight matrices from HBM to feed a tiny GEMV operation. This is why the H100 at 4% compute utilization is operating correctly -- the compute is not the constraint, the memory bandwidth is.
But "memory-bandwidth-bound" does not mean "compute is idle." It means the current kernels are not overlapping memory movement with computation because they are written in the standard SIMT model where every warp does the same thing sequentially. Load. Compute. Load. Compute.
Warp-specialized kernels do: producers load tile N+1 while consumers compute on tile N. The timeline compresses. The effective throughput relative to theoretical bandwidth ceiling improves because you are hiding compute latency inside memory latency instead of adding them.
The practical result from the Tawa paper: 5-10% throughput improvement on GEMM kernels from persistent warp specialization alone, without changing the algorithm. 1.58x speedup on autoregressive decoding kernels compared to FlashInfer baseline, on a single B200 GPU.
"But those are small improvements for the complexity invol--" The complexity is why nobody is doing it. The gains are per-kernel and they compound across a full inference pass. A 10% improvement in GEMV throughput during decode is 10% more tokens per second at no additional hardware cost. At Anthropic's $30B revenue scale, 10% more throughput on the same fleet is not a small number.
The second thing I've been reading about is bubbles. Specifically GPU bubbles -- the idle time between kernels in a distributed inference deployment.
In production LLM inference under tensor parallelism (Llama-70B split across multiple GPUs), 24% of GPU time is idle in small bubbles. Not large bubbles from scheduling decisions -- microsecond-scale gaps between kernel launches, caused by device-host synchronization for continuous batching metadata, token transfers for streaming responses, barrier overhead between NCCL collectives.
24% is not a rounding error. If you are running an inference cluster at scale, you are paying for GPUs to sit idle for a quarter of their time because of housekeeping overhead between the kernels that do the actual work.
Hummingbird (January 2026) attacks this by injecting best-effort work into the bubbles. The key observation: DNN inference kernels are mostly idempotent -- if you kill a kernel mid-execution and restart it, you get the same result, because there are no external side effects. This means you can preempt a best-effort kernel the moment a high-priority kernel needs to run, restart the best-effort kernel from scratch when the bubble resumes, and lose only the work done since the last checkpoint.
The preemption mechanism is the hard part. NVIDIA's CUDA runtime doesn't expose the scheduling queues. Hummingbird wraps each CUDA stream in a virtual host queue that intercepts and buffers kernel launches, and exploits the GPU trap mechanism -- originally designed for debugging -- to kill running kernels at microsecond granularity. A thread block on Hopper/Blackwell runs for about 100-1000 microseconds. The scheduler can preempt at thread-block boundaries without saving warp state.
The result: high-priority inference SLOs are maintained while best-effort work harvests the gaps. GPU utilization climbs toward 90%+ without adding hardware. The gaps that were paying for nothing are now paying for something.
These two problems -- warp specialization for individual kernel throughput, and bubble harvesting for cluster utilization -- are being solved at the same time, at the same hardware generation, for the same reason: the H100 and B200 architectures introduced enough programmability (TMA, async barriers, WGMMA) that these techniques became possible, and the scale of inference deployments at companies like Anthropic made the performance gaps expensive enough to fix.
The tooling is not there yet. Tawa automates warp specialization via compilation but requires manually specifying which operations are producers and which are consumers -- the fully automated version that takes a compute graph and emits optimal warp-specialized code is still a research problem. Hummingbird requires a custom runtime layer that wraps the CUDA runtime and exploits debugging APIs not intended for production use.
Both will be production tools within 18 months. The papers are already written. The implementations are running. The companies with the engineering resources to productionize them are the companies with the inference scale to make it worth the investment.
Everyone else will get it eventually via vLLM and Dynamo updates.
warp specialization.
producers load. consumers compute. they run at the same time in the same thread block.
this is what the h100's tensor memory accelerator was built to enable.
most kernels don't use it. most engineers have never heard of it.
the gap between "code that runs on the hardware" and "code that runs how the hardware was designed to run" is where the 10x improvements live.
it's always been there. it just requires knowing the hardware well enough to see it.
the interesting thing about the 24% idle bubbles number: it means we are already paying for 24% more hardware than we need to serve the same traffic. we just haven't built the systems to use what we already bought.
P.S. The ParallelKittens paper from Stanford (November 2025) extends this to multi-GPU kernels -- how to write kernels that span NVLink-connected GPUs with overlapping compute and communication, using the right data movement primitive for the job (copy engine vs TMA vs register-level instructions, each optimal at different message sizes). The data movement decision alone changes performance by 4x depending on which mechanism you pick and what size you're transferring. That paper has a figure that should be in every GPU platform team's internal wiki and almost nobody knows it exists.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.