Skip to main content

Notes

i wrote these. not a model. not a prompt. not a template. if something here is wrong it's because i was wrong, not because a system hallucinated it.

71ms per forward pass. budget is 35ms. the hardware told me before i wrote a single line of code.

April 18, 2026

Building a serving system for video world models. The math forced every decision before I named a single abstraction.

two models shipped this month that broke a rule everyone believed about memory and capability.

April 17, 2026

Gemma 4 E2B runs in a browser tab. Nemotron 3 Super runs 1M context on a single GPU. Neither should be possible.

the CPU is on the critical path for every token you've ever generated.

April 16, 2026

Blink removes the CPU from inference serving entirely. 8.47x P99 TTFT. SmartNIC + persistent GPU kernel.

your inference engine evicts the KV cache the moment the agent calls a tool.

April 15, 2026

Then the tool returns. Then you recompute everything from scratch. Every time. On every tool call.

they let the model run Kaggle competitions alone for 24 hours. it kept getting better.

April 13, 2026

MiniMax M2.7: open weights, $0.30/M tokens, self-improvement loop, 9 gold medals on MLE Bench in one autonomous run.

nobody is talking about the NIC hop.

April 10, 2026

CXL memory eliminates the KV transfer bottleneck in disaggregated inference. 9.8x TTFT improvement. The plumbing paper nobody read.

90% of Meta's model parameters are embeddings. they've been running them on tensor cores for years.

April 8, 2026

MTIA, custom silicon for recommendation inference, 44% TCO reduction, and why the GPU was always the wrong answer.

the H100 was designed for something most kernels don't do.

April 5, 2026

Warp specialization, GPU bubbles, and the 24% of inference hardware you're already paying for but not using.

this is not an anti-AI stance. this is an anti-idiot stance.

April 2, 2026

Vibe coding is a multiplier. It multiplies what you already are.

you are not paying for compute. you are paying for idle.

March 28, 2026

At 10% utilization, self-hosted inference costs 6x more than the API. The math only works above 90%.

Google just quietly shipped Pied Piper.

March 22, 2026

TurboQuant compresses the KV cache 6x at 3 bits with no fine-tuning. Nobody is talking about it.

the agent got it right. the framework got it wrong.

March 8, 2026

Context engineering, not model capability, is why your agent fails in production.

The jump looked wrong. The physics were real.

February 22, 2026

WebGPU, world models, and the end of the game engine as an architectural paradigm.

the transformer isn't dying. it's getting a co-pilot.

February 2, 2026

Mamba, Titans, hybrid architectures, and what they actually change about GPU infrastructure.

the frame budget is 16 milliseconds. it does not negotiate.

January 9, 2026

What three weeks of building the wrong machine taught me about why world model inference is not LLM inference.

4% compute utilization. everything working exactly as it should.

November 18, 2025

Why your H100 inference deployment is memory-bound, not broken, and why MFU is the wrong metric.

the pipeline was green. the model was wrong.

October 2, 2025

Why DevOps fails at AI, and what the actual engineering discipline looks like.

the scheduler gave me eight GPUs. they were the wrong eight GPUs.

August 28, 2025

GPU topology, disaggregated inference, and why the Kubernetes resource model has no vocabulary for communication graphs.

i've been catching hardware failures before the hardware knows.

July 12, 2025

ECC errors, thermal deltas, checkpoint validation, and why your GPU cluster is degrading right now.

stop paying for free software with your Mondays.

April 28, 2025

Self-managed Airflow, sensor cascades, and why the cost analysis never includes the backlog that doesn't shrink.