4% compute utilization. everything working exactly as it should.

It was 9:43am on a Wednesday and I was staring at an H100 running at 4% compute utilization.

Not 40%. Not 14%. Four.

This was a production inference deployment. A real model. Real traffic. I had been told the cluster was "running well." It was generating tokens. Latency was acceptable. Nobody had opened Nsight Compute because everything looked fine from the outside.

I opened Nsight Compute. Everything was not fine.

Here is what I saw. In the decode phase, the warp stall analysis showed over 50% of attention kernel cycles stalled. Not computing. Waiting. The Nsight timeline showed high DRAM read activity running flat across the entire decoding step while compute utilization sat at 4% and occasionally spiked to 19% toward the end of each step before dropping again.

The warps were stalling because they had asked for data from HBM and HBM had not delivered it yet. The 32 threads in each warp advance in lockstep. One thread waits, all 32 wait. They were all waiting. Simultaneously. On almost every clock cycle.

I had an H100. The H100 has 989 TFLOPS of BF16 compute. I was using somewhere under 20 of them.

The machine is not the bottleneck. I am the only person in this story who was confused about that.

Here is the actual bottleneck, and it is not a bug and it is not a misconfiguration. It is physics.

LLM inference has two phases. Prefill: you process the input prompt all at once, in parallel, token by token through the transformer stack simultaneously. This is a GEMM. Big matrices. High arithmetic intensity. The GPU is doing many FLOPs per byte of data it moves. This is compute-bound. The H100's tensor cores are happy. This is what the H100 was built for.

Then decode begins. You generate output tokens one at a time. Each new token is one vector, not a matrix. One row of a weight matrix multiplied by one vector. A GEMV operation. The arithmetic intensity collapses. You are moving billions of bytes of model weights from HBM to feed a tiny amount of computation.

The roofline model makes this explicit. Plot FLOPs per byte on the x-axis. Peak performance on the y-axis. To the left of the ridge: you are memory-bound, limited by how fast data moves. To the right: compute-bound, limited by how fast math happens. The ridge for an H100 is at about 295 FLOPs per byte.

A decode step has arithmetic intensity of roughly 1 to 5 FLOPs per byte. Depending on batch size. Depending on model size.

You are operating 60 to 200 times to the left of the ridge. You are firmly, deeply, structurally in the memory-bound regime. The 989 TFLOPS of compute is not the constraint. The 3.35 TB/s of HBM bandwidth is the constraint. And you are saturating it with weight loads, not computing anything interesting with most of them.

"But if you increase the batch size you—" Yes. Larger batches push matmul kernels rightward on the roofline. The matrix multiplications gain arithmetic intensity as the batch dimension grows. More FLOPs per byte. That is correct.

The attention kernel does not move. Its arithmetic intensity stays nearly constant regardless of batch size. You are attending each token over the entire KV cache. The KV cache grows with batch size. You are reading more memory, not doing more math per byte. The attention mechanism stays pinned in the memory-bound regime while the matmuls climb toward the ridge. At large batch sizes you have DRAM saturation as the dominant bottleneck inside attention, and a throughput plateau as the ceiling. Not a compute plateau. A bandwidth plateau.

I ran the Nsight Compute roofline analysis. The attention kernels were so far left on the chart I had to check the axis scale twice.

The H100 is the wrong machine for this problem. Well, maybe not wrong. It is the right machine for training. It is the right machine for prefill. It is an extraordinarily expensive, severely underutilized machine for autoregressive decode, which is the phase that determines the user's experience.

Groq built a completely different chip to address this. Not faster at compute. Faster at memory bandwidth. Their LPU design prioritizes streaming model weights from on-chip SRAM at hundreds of terabytes per second rather than building more tensor cores. The bet is that decode inference is never going to become compute-bound, so the right hardware choice is to move memory faster, not add more FLOPs.

Cerebras made a similar bet. Wafer-scale SRAM. No HBM. No memory wall. Different constraints, different tradeoffs, same diagnosis: the bottleneck in autoregressive decode is not computation. It is data movement.

The CUDA ecosystem built its entire value proposition on FLOPS. More tensor cores. Higher precision throughput. Bigger matrix multiplications. All of that is correct for training. For prefill. For any workload that is genuinely compute-bound.

Decode is not. Decode is a streaming memory workload wearing a deep learning costume.

This is what makes FlashAttention worth understanding at the kernel level and not just as a box to check in your framework config. FlashAttention does not make attention faster by doing less math. It tiles the attention computation so that intermediate results stay in SRAM instead of being written out to HBM and read back. It fuses the softmax and the matrix multiplications into a single kernel pass. When it is working correctly, HBM bandwidth utilization during attention drops 50 to 80 percent while SM utilization increases. You are doing the same math. You are moving dramatically less data.

"But Flash Attention is already enabled in—" Is it? Open Nsight Compute. Look at HBM read bandwidth during your attention kernels. If it is not dropping during attention computation compared to your matmul kernels, it is not working the way you think.

I have found this misconfiguration in three separate production deployments. Not because engineers are careless. Because the inference engine enables FlashAttention by default, the unit tests pass, the latency number is acceptable, and nobody opens the profiler to verify the kernel-level behavior.

The profiler is the instrument. The latency metric is a shadow on the wall. They are not the same thing.

There is one more thing I want to say about MFU and why it is the wrong metric for most inference workloads.

MFU is Model FLOP Utilization. Achieved FLOPs divided by theoretical peak FLOPs. Expressed as a percentage. It became the standard metric for measuring how well you are using a GPU. For training, it is the right metric. Training is compute-bound. If your MFU is 40%, you are using 40% of available computation.

For decode inference, MFU is measuring the wrong dimension. Decode is memory-bound. Peak FLOPs is not your ceiling. Peak memory bandwidth is. A decode step with 4% MFU and 95% Memory Bandwidth Utilization is a correctly-running decode step that has saturated the actual bottleneck. The 96% of FLOPs you are not using are not wasted. They are simply not relevant to your constraint.

Databricks introduced MBU, Memory Bandwidth Utilization, as a complementary metric for exactly this reason. MBU is achieved memory bandwidth divided by theoretical peak memory bandwidth. When MBU approaches 100% while MFU stays low, you have confirmed that memory bandwidth is your ceiling and your system is operating correctly within that ceiling.

The teams measuring only MFU in decode inference are running a fuel gauge on a car that is not limited by fuel. They see 4% and think something is broken. The car is running fine. The metric is wrong.

I spent six hours in Nsight Compute on that Wednesday. What I found was not a broken deployment. It was a correctly-running deployment that nobody had ever explained to themselves at the hardware level.

The H100 was doing exactly what physics allowed it to do. 3.35 TB/s of bandwidth. Saturated. Decode tokens streaming out at the rate that bandwidth permits.

The 989 TFLOPS sat idle. Waiting for a workload they were built for.

the number that matters for decode is not tflops.

it is terabytes per second.

if you have never opened nsight compute on your inference deployment, you do not know what is actually happening on your hardware. you know what the dashboard says.

those are not the same thing.

4% compute utilization. everything working exactly as it should.