Notes
i wrote these. not a model. not a prompt. not a template. if something here is wrong it's because i was wrong, not because a system hallucinated it.
71ms per forward pass. budget is 35ms. the hardware told me before i wrote a single line of code.
April 18, 2026Building a serving system for video world models. The math forced every decision before I named a single abstraction.
two models shipped this month that broke a rule everyone believed about memory and capability.
April 17, 2026Gemma 4 E2B runs in a browser tab. Nemotron 3 Super runs 1M context on a single GPU. Neither should be possible.
the CPU is on the critical path for every token you've ever generated.
April 16, 2026Blink removes the CPU from inference serving entirely. 8.47x P99 TTFT. SmartNIC + persistent GPU kernel.
your inference engine evicts the KV cache the moment the agent calls a tool.
April 15, 2026Then the tool returns. Then you recompute everything from scratch. Every time. On every tool call.
they let the model run Kaggle competitions alone for 24 hours. it kept getting better.
April 13, 2026MiniMax M2.7: open weights, $0.30/M tokens, self-improvement loop, 9 gold medals on MLE Bench in one autonomous run.
nobody is talking about the NIC hop.
April 10, 2026CXL memory eliminates the KV transfer bottleneck in disaggregated inference. 9.8x TTFT improvement. The plumbing paper nobody read.
90% of Meta's model parameters are embeddings. they've been running them on tensor cores for years.
April 8, 2026MTIA, custom silicon for recommendation inference, 44% TCO reduction, and why the GPU was always the wrong answer.
the H100 was designed for something most kernels don't do.
April 5, 2026Warp specialization, GPU bubbles, and the 24% of inference hardware you're already paying for but not using.
this is not an anti-AI stance. this is an anti-idiot stance.
April 2, 2026Vibe coding is a multiplier. It multiplies what you already are.
you are not paying for compute. you are paying for idle.
March 28, 2026At 10% utilization, self-hosted inference costs 6x more than the API. The math only works above 90%.
Google just quietly shipped Pied Piper.
March 22, 2026TurboQuant compresses the KV cache 6x at 3 bits with no fine-tuning. Nobody is talking about it.
the agent got it right. the framework got it wrong.
March 8, 2026Context engineering, not model capability, is why your agent fails in production.
The jump looked wrong. The physics were real.
February 22, 2026WebGPU, world models, and the end of the game engine as an architectural paradigm.
the transformer isn't dying. it's getting a co-pilot.
February 2, 2026Mamba, Titans, hybrid architectures, and what they actually change about GPU infrastructure.
the frame budget is 16 milliseconds. it does not negotiate.
January 9, 2026What three weeks of building the wrong machine taught me about why world model inference is not LLM inference.
4% compute utilization. everything working exactly as it should.
November 18, 2025Why your H100 inference deployment is memory-bound, not broken, and why MFU is the wrong metric.
the pipeline was green. the model was wrong.
October 2, 2025Why DevOps fails at AI, and what the actual engineering discipline looks like.
the scheduler gave me eight GPUs. they were the wrong eight GPUs.
August 28, 2025GPU topology, disaggregated inference, and why the Kubernetes resource model has no vocabulary for communication graphs.
i've been catching hardware failures before the hardware knows.
July 12, 2025ECC errors, thermal deltas, checkpoint validation, and why your GPU cluster is degrading right now.
stop paying for free software with your Mondays.
April 28, 2025Self-managed Airflow, sensor cascades, and why the cost analysis never includes the backlog that doesn't shrink.