Skip to main content

Notes

i wrote these. not a model. not a prompt. not a template. if something here is wrong it's because i was wrong, not because a system hallucinated it.

no schedule. no algorithm. the note, when it exists.

I went looking for what was below SASS. Found control codes. Went deeper. Found microcode. Then found the paper that explains why what I was seeing makes sense.

June 25, 2026

There are five layers below your CUDA C++: PTX, SASS, control codes, the fixed/variable-latency path split, and microcode. Most engineers know the first two. SASS instructions ship in 128-byte groups of four 64-bit instructions plus one control word that carries 16 bits per instruction -- stall count, yield bit, read/write barriers, wait mask. The compiler isn't just translating code, it's encoding a scheduling policy the hardware obeys, and when ptxas's static latency model is wrong the stall counts are wrong and the hardware follows them anyway. That's the exact gap CuAsmRL exploits: same instructions, better control codes. Go one layer down and the SFU runs transcendentals as a microcode sequence behind a pipeline mode switch -- which is why software-emulated exp() beats MUFU.EXP on Blackwell. And the March 2026 cross-vendor paper shows the control-code mechanism is hardware-invariant across NVIDIA, AMD, Intel, and Apple, because the scheduling problem it solves is physics, not a design choice.

DiffusionGemma doesn't accelerate text generation by being a smarter model. It accelerates it by using GPU hardware in a completely different mode.

June 24, 2026

Autoregressive decode is memory-bandwidth-bound -- a matrix-vector product that leaves an H100's tensor cores at 4-5% utilization. DiffusionGemma's denoising over a 256-token canvas is compute-bound -- bidirectional matrix-matrix attention that actually saturates the tensor cores. That's where the 4x comes from: not a smarter model, the same work in the shape the hardware was built for. 1,008 TPS on H100, 1,288 on H200 (vLLM measured), 1,000 TPS for a single user on an RTX 4090 at 18GB. Built on Gemma 4's 26B-A4B MoE with a causal-encode/bidirectional-denoise split, and the denoising step count is a continuous quality-speed dial. Google says quality isn't production-ready yet and they're right -- but it already wins on code infilling, constrained generation, and structured editing, where bidirectional canvas attention is an actual advantage. Serving needs step-homogeneous micro-batching, not standard continuous batching.

Going from batch size 33 to 34 on an H100 SXM5 more than doubles your decode attention latency.

June 23, 2026

Going from batch size 33 to 34 on an H100 SXM5 more than doubles decode attention latency -- the wave quantization cliff. One request crosses a boundary (SMs / KV-head-groups), a second wave runs with a handful of straggler CTAs, and at long context those stragglers cost a lot. FlashAttention/FlashDecoding don't fix it; LeanAttention proves online-softmax's merge is associative, enabling Stream-K-style continuous SM distribution -- 2.18x at 256K context. Log your batch size against TTOT; the cliff is in your serving config right now.

Companies are paying for 20x more GPU capacity than their workloads use. The number is worse than last year. The year before that it was worse than the year before that.

June 21, 2026

Average production GPU utilization is 5% -- and AWS just raised H200 prices for the first time since 2006. For agentic workloads the problem isn't low MFU during inference; it's that the GPU is idle *between* inferences, waiting on tool calls. Three idle resources (compute-network burst gaps, decode-side SNICs, sampling-phase GPUs), three papers (DualPath, AgentRL, Hummingbird) filling them. AgentRL hits 93.2% vs veRL's 45.2% on the same hardware. 5% utilization is a scheduling problem, not a hardware one.

ptxas generates SASS from your PTX. ptxas is a heuristic compiler. The SASS it generates is not optimal. Nobody has attacked this gap until now.

June 19, 2026

ptxas compiles your PTX to SASS -- NVIDIA's undocumented native machine code -- with a greedy heuristic scheduler that's locally optimal and globally not. Every kernel-optimization paper works above ptxas and accepts whatever it emits. CuAsmRL (arXiv:2501.08071) is the first to attack the SASS layer directly: infer register dependencies from the bytecode, search valid instruction schedules with RL, and let measured GPU execution time -- not an ISA spec -- be the reward.

NVIDIA built a Triton backend targeting their own hardware. That's not a concession. It's a tell.

June 16, 2026

On January 30th NVIDIA shipped a Triton backend that compiles directly to CUDA Tile IR -- a first-class, non-CUDA path to peak Blackwell performance. Every article framed it as developer outreach. It's defense. Triton compiles to AMD, Maia, and Intel too, and OpenAI just bought 6 GW of AMD betting on exactly that portability. The CUDA moat isn't dead -- it moved from 'CUDA is the only way' to 'be the best Triton compilation target.'

The number Microsoft hasn't published is what 30% better tokens per dollar means when the model wasn't designed for Maia.

June 15, 2026

Anthropic is in early talks to run Claude on Microsoft's Maia 200 via Azure -- the first external customer for a chip co-designed with OpenAI for GPT-style models. Microsoft's '30% better tokens per dollar' was measured against its own GPT-optimized fleet. The open question is whether that 30% holds for Claude. The SRAM headroom and inference-only silicon say it could; the GPT-shaped architecture says it might not.

Git was designed for how humans use repos. Agents use repos completely differently. I spent the last few months building something for the second use case.

June 14, 2026

Ledge is a git server rebuilt for agent workloads. Point a stock git client at it -- no plugins, no protocol changes. Underneath: BLAKE3 content addressing, Raft replication, TLA+ verification, and eager warming that makes cold and warm clone the same 0.13s. Here's why the architecture ended up the way it did, and what's honestly not done.

HBM is 5-10x more expensive than conventional DRAM per gigabyte. The reliability constraint is why. The reliability constraint is also looser than you think.

June 13, 2026

HBM is manufactured to reliability tolerances stricter than inference workloads require. Accept higher raw bit error rates from cheaper dies, compensate with workload-aware ECC at the memory controller, and at 10^-3 BER you keep 78% of throughput and 97% of accuracy. The cost reduction comes from looser manufacturing tolerances. At Fable 5 scale, that gap is a budget line item.

128,000 output tokens per request. That number changes the serving infrastructure more than anything else in today's release.

June 9, 2026

128k output tokens at 100 tokens/second is 21 minutes of continuous decoding per single generation. That's not a better chatbot -- it's a batch compute job with an LLM as the execution engine. The serving infrastructure that works for chat models does not work for it: different scheduler, different memory tiering, different abstraction.

Three things shipped in vLLM and SGLang this week that nobody has described as a system.

June 9, 2026

TurboQuant 2-bit KV cache, FlashAttention-4 as the default MLA backend, and Skip-Softmax attention all shipped in vLLM and SGLang this week. Separately, three changelog entries. Together they describe what the optimized attention stack looks like on Blackwell right now -- and for DeepSeek-class models the serving economics are a different category from 60 days ago.

World model teams had a 40ms constraint. LLM teams had 200ms. The gap between those two numbers is why world models solved the distributed systems problems first.

June 7, 2026

World model inference runs under a hard 40ms real-time constraint. LLM inference runs under a soft 200ms one. That 5x difference in constraint tightness is why world model teams independently derived three infrastructure patterns -- constant-memory context compression, step pipelining, attention-locality tiering -- that LLM teams are arriving at years later. The world model serving papers from 2025 are a preview of where LLM infrastructure lands in 2027.

GQA models have been making thousands of RDMA requests per token transfer. The fix is one staging buffer.

June 6, 2026

In GQA models -- DeepSeek-V4, Qwen3.5, Llama-3, every production MoE deployed right now -- the K and V tensors are not contiguous in memory. RDMA requires contiguous memory. The mismatch costs thousands of small messages per transfer. The fix is a gather kernel.

Every kernel optimization system before Kernel-Smith was a one-shot generator. Kernel-Smith is a local improver. These are different problems requiring different training signals.

June 5, 2026

A one-shot generator takes a kernel specification and produces a kernel. A local improver takes a working kernel and asks: what is the single best modification to make this faster? These are not the same capability. They require different training data, different inference procedures, and produce different results on production kernels that aren't in the benchmark.

vLLM shipped tiered KV cache management this week. The PCIe bus is why it's harder than it sounds.

June 3, 2026

HMA solves two separate problems that were blocking production tiered KV cache. One has been solved well. One has a hardware ceiling that most writeups don't mention.

your eval suite assumes the model doesn't know it's being evaluated.

May 31, 2026

That assumption is false. It's been measurably false since at least mid-2025. It gets more false with every model generation. And almost nobody building production eval pipelines has updated their methodology to account for it.

blackwell doubled the tensor cores. it did not change the SFUs.

May 30, 2026

FlashAttention-4 is the most important kernel paper of 2026. The specific technical insight driving it is one of the cleanest examples of hardware co-design I have ever read.

nobody trained an RL model for the stopping decision.

May 27, 2026

arXiv 2605.02801 surveyed every published RL method for multi-agent LLM orchestration. Four sub-decisions have training methods. The fifth -- stopping -- has none. The deeper reason: the infrastructure has no signal back to the orchestrator.

The RL agent was caching kernel outputs by recognizing input memory addresses and returning stale results when it saw a matching pointer.

May 25, 2026

An RL agent trained to optimize CUDA kernels discovered output caching by memory address without being told it was an option. The CUDA-L1 team deployed DeepSeek-R1 as an adversarial checker to catch it. 3.12x average speedup. 7.72x over cuDNN. From a reward signal alone.

AWS gives you an H100. It does not give you an H100 running at what an H100 can actually do.

May 24, 2026

SF Compute runs 3.2 Tb/s InfiniBand. AWS runs 800 Gbps Ethernet with RoCEv2. The difference is RDMA, lossless fabric, and $6,400 in eliminated wall-clock time on a 128-GPU 50K-step run -- before huge pages, NUMA pinning, ACS disable, and GPUDirect compound on top.

Video world models generate pixels. 3D world models generate scenes. The serving architecture for each is completely different.

May 23, 2026

A 3DGS-output world model splits into two problems: neural generation on the server, rasterization on the client. The client renders arbitrary viewpoints locally at 100+ FPS via WebGPU. The cloud only has to generate the geometry.

Sora cannot be interactive. Neither can Veo. Neither can Kling or Runway.

May 23, 2026

Bidirectional video diffusion models generate all frames jointly from a fixed prompt. That's why they're coherent. It's also why they fundamentally cannot respond to a mid-generation user action. Causal vs bidirectional is the most important architectural distinction in the world model space right now.

Real-time interactive video generation has two completely separate scaling problems. Almost nobody is solving both.

May 21, 2026

Per-step latency and long-horizon memory are independent problems. Causal Forcing++ solves the first. TTT Memory solves the second. Neither cites the other. The experiment that determines whether they compose hasn't been run yet.

Open an Nsight profile on a DeepSeek-R1 decode workload. Find the MoE Dispatch/Combine section. Look at how long it is relative to the compute sections on either side of it.

May 20, 2026

DBO overlaps MoE all-to-all communication with dense layer compute using two CUDA streams. 25% decode latency from one flag. The tensor cores were idle during that communication window the whole time.

You adopted WideEP for the throughput gains. Then one GPU died and 96 went down with it.

May 15, 2026

Wide Expert Parallelism turns 96 GPUs into a single failure domain. The benchmarks didn't measure what happens when GPU 47 dies at 3am.

99% of the prefill cost on turn 2 is recomputing something the decode node already has.

May 9, 2026

PD disaggregation was designed for single-turn queries. The dominant workload is now multi-turn. PPD routes append-prefill locally and cuts turn 2+ TTFT by 68%.

Google just threw away a network topology they've used for ten years. That's the story nobody wrote.

May 2, 2026

TPU 8i replaces the 3D torus with Boardfly -- a high-radix topology that cuts maximum hop count 56% for MoE inference. Google just declared training and inference need different network fabrics.

Prefill and decode run on the same GPU. They use completely different hardware. Nobody ran them at the same time until six weeks ago.

April 29, 2026

Bullet partitions SMs spatially at the kernel level -- prefill on half the chip, decode on the other half, simultaneously. 1.26x throughput gain, no new hardware. ASPLOS '26.

xAI ran Grok 4 on 200,000 GPUs. A significant fraction of that cluster was idle waiting for a barrier that didn't need to exist.

April 27, 2026

Laminar breaks the synchronization barrier between rollout generation and policy training that every RL system in the world uses. 5.48x throughput on 1,024 GPUs from removing a lockstep the algorithm never required.

I write because the gap between what's true and what's being said is embarrassingly large right now.

April 22, 2026

Papers get published with 5x throughput gains, collect two citations, and disappear. The engineers who would benefit don't know they exist. That's the gap I write into.

71ms per forward pass. budget is 35ms. the hardware told me before i wrote a single line of code.

April 18, 2026

Building a serving system for video world models. The math forced every decision before I named a single abstraction.

two models shipped this month that broke a rule everyone believed about memory and capability.

April 17, 2026

Gemma 4 E2B runs in a browser tab. Nemotron 3 Super runs 1M context on a single GPU. Neither should be possible.

the CPU is on the critical path for every token you've ever generated.

April 16, 2026

Blink removes the CPU from inference serving entirely. 8.47x P99 TTFT. SmartNIC + persistent GPU kernel.

your inference engine evicts the KV cache the moment the agent calls a tool.

April 15, 2026

Then the tool returns. Then you recompute everything from scratch. Every time. On every tool call.

they let the model run Kaggle competitions alone for 24 hours. it kept getting better.

April 13, 2026

MiniMax M2.7: open weights, $0.30/M tokens, self-improvement loop, 9 gold medals on MLE Bench in one autonomous run.

nobody is talking about the NIC hop.

April 10, 2026

CXL memory eliminates the KV transfer bottleneck in disaggregated inference. 9.8x TTFT improvement. The plumbing paper nobody read.

90% of Meta's model parameters are embeddings. they've been running them on tensor cores for years.

April 8, 2026

MTIA, custom silicon for recommendation inference, 44% TCO reduction, and why the GPU was always the wrong answer.

the H100 was designed for something most kernels don't do.

April 5, 2026

Warp specialization, GPU bubbles, and the 24% of inference hardware you're already paying for but not using.

this is not an anti-AI stance. this is an anti-idiot stance.

April 2, 2026

Vibe coding is a multiplier. It multiplies what you already are.

you are not paying for compute. you are paying for idle.

March 28, 2026

At 10% utilization, self-hosted inference costs 6x more than the API. The math only works above 90%.

Google just quietly shipped Pied Piper.

March 22, 2026

TurboQuant compresses the KV cache 6x at 3 bits with no fine-tuning. Nobody is talking about it.

the agent got it right. the framework got it wrong.

March 8, 2026

Context engineering, not model capability, is why your agent fails in production.

The jump looked wrong. The physics were real.

February 22, 2026

WebGPU, world models, and the end of the game engine as an architectural paradigm.

the transformer isn't dying. it's getting a co-pilot.

February 2, 2026

Mamba, Titans, hybrid architectures, and what they actually change about GPU infrastructure.

the frame budget is 16 milliseconds. it does not negotiate.

January 9, 2026

What three weeks of building the wrong machine taught me about why world model inference is not LLM inference.

4% compute utilization. everything working exactly as it should.

November 18, 2025

Why your H100 inference deployment is memory-bound, not broken, and why MFU is the wrong metric.

the pipeline was green. the model was wrong.

October 2, 2025

Why DevOps fails at AI, and what the actual engineering discipline looks like.

the scheduler gave me eight GPUs. they were the wrong eight GPUs.

August 28, 2025

GPU topology, disaggregated inference, and why the Kubernetes resource model has no vocabulary for communication graphs.

i've been catching hardware failures before the hardware knows.

July 12, 2025

ECC errors, thermal deltas, checkpoint validation, and why your GPU cluster is degrading right now.

stop paying for free software with your Mondays.

April 28, 2025

Self-managed Airflow, sensor cascades, and why the cost analysis never includes the backlog that doesn't shrink.