i've been catching hardware failures before the hardware knows.
ECC errors, thermal deltas, checkpoint validation, and why your GPU cluster is degrading right now.
July 12, 2025I'm back. It's been months. I don't know exactly how many without counting and I don't want to count because counting would make it a thing.
I didn't stop writing because I ran out of things to say. I stopped because I started saying things that felt performed and I'd rather say nothing than perform. Some of you unsubscribed. That's correct behavior. I would have too.
Anyway. I got really into bread. Not sourdough. Everyone did sourdough, I wasn't going to do sourdough. I got into focaccia, which is more forgiving and also you can put things on top of it and feel like a person who has their life together. I made it probably 20 times over four months. I got good. I made it for people and they said it was good and I believed them because they came back for more.
I am telling you this because it is true and because the alternative is pretending I sat in a dark room thinking about infrastructure for four months, which is partially true but sounds insane.
Anyways.
I want to talk about something that has been annoying me for a long time. Which is how most teams discover hardware failures in GPU clusters.
The answer is: by accident. After the damage is done.
Here is the failure mode nobody talks about.
You are three weeks into a training run. Loss curve looks fine. Checkpoints saving. Job running. And somewhere in the cluster, one GPU is accumulating corrected ECC memory errors. Not enough to crash, not enough to throw an exception, just enough that a small number of activations on the forward pass are wrong. The model is training on slightly corrupted numbers. The corruption is distributed across billions of parameters. There is no obvious signature.
You will not find this until you evaluate the final checkpoint and the numbers look strange. Then you spend four days ruling out everything else. Then someone checks the hardware error logs. Then you find out the GPU has been degrading for two and a half weeks.
Then you rerun three weeks of compute.
In a cluster of 10,000 GPUs running a 3-month job, the probability of at least one hardware failure during that run approaches 100%. The GPU does not announce this. It degrades. Correctly, from its own perspective. It processed the instruction, it returned a value, the value happened to be wrong at the bit level. This is not an edge case. This is Tuesday.
The signals I watch now. And I mean actually watch, every morning on long runs.
NVLink error rates per GPU per hour. Not aggregate. Per GPU. A healthy H100 in a healthy cluster should have near-zero corrected errors. When one starts accumulating even single-digit corrected errors per hour, that GPU is probably 72-96 hours from an event that will corrupt a checkpoint or kill a job. The corrected errors are the hardware saying "I caught this one." The question you cannot answer from the logs alone is how many it didn't catch.
VRAM thermal deltas between GPUs in the same rack. A GPU running 8 degrees hotter than its neighbors is not necessarily failing. But it is worth watching. The thermal delta is one of the earliest signals that something in the hardware is changing. Worse airflow. A component starting to degrade. A cooling issue that is not yet a crash issue.
PCIe link speed and width over time. A GPU that negotiated x16 at startup and is running at x8 two weeks into a job is a GPU whose connection to the system is degrading. Your all-reduce operations are running at half the bandwidth you paid for. Your step time is increasing by a few percent. You are attributing this to variance. It is not variance.
ECC correctable errors per memory bank. Most teams monitor uncorrectable errors because those throw exceptions. The correctable errors are quieter and earlier. The uncorrectable error is the hardware telling you it failed. The correctable error is the hardware telling you it is going to fail.
DCGM surfaces all of this. Data Center GPU Manager, NVIDIA's telemetry stack, runs as a daemonset on every GPU node, exposes hardware counters to Prometheus. The counters are there by default. The dashboards for them mostly aren't.
"But our cloud provider monitors the hardwa—" For crashes. Not for degradation. The alert fires when the job dies. You want the alert four days before the job dies. That means querying DCGM health metrics at the per-GPU level with collection intervals tight enough to catch error rate trends, not just snapshots. Most DCGM exporter configs I have seen in the wild are optimized for utilization monitoring. GPU%, memory bandwidth, temperature averages. Not for health monitoring. Those are different configurations and different dashboards and they answer different questions.
Utilization tells you how busy the hardware is.
Health tells you whether the hardware is okay.
I check five dashboards every morning on long runs. Three are utilization. Two are health. The health dashboards have caught four pending failures before they became actual failures in the last year. The utilization dashboards have never caught anything. I keep them because stakeholders want to see GPU% numbers. I keep the health dashboards because I want to keep my jobs.
The thing that still bothers me most is checkpoint validation. Or the absence of it.
Most teams save a checkpoint and assume it is valid because it wrote without an I/O error. An intact file is not the same as a file containing correct weights. Silent memory corruption means the weights were wrong before they were written. The file is fine. The weights are corrupt. The job continues from a corrupt state. The loss continues to move. The wrongness is invisible.
What I do now: checksum every checkpoint, plus a lightweight forward pass on a fixed validation batch compared against a reference from a healthy run. If the outputs diverge beyond a threshold, the checkpoint is suspect and I audit the hardware before continuing.
This adds about three minutes per checkpoint. A corrupt checkpoint that makes it to the end of a three-week run and is discovered only at evaluation costs three weeks.
Three minutes or three weeks. That is the choice. Most teams do not realize they are making it.
The job of someone running AI infrastructure at scale is not to react to hardware failures.
It is to see them coming.
The hardware will not tell you. The default dashboard will not tell you. The alert you configured will fire after the job is dead.
The teams building serious training infrastructure right now are not better at recovering from failures. They are better at catching them 72 hours before they become failures. That is the whole delta.
i was gone for months and i thought about focaccia and i thought about ECC error rates and i thought about how those are not that different. both are about catching the problem before it ruins the thing you spent all that time building.
the focaccia was better when i paid attention to the dough.
the training runs were better when i paid attention to the hardware.
if your jobs run longer than a week and you are not watching correctable ECC errors per GPU in real time, you are hoping. you're probably right. until you're not.
P.S. Focaccia tip since I mentioned it: don't skimp on the olive oil in the pan. More than you think. Way more. The bottom should be basically frying. This is not optional. This is the whole thing.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.