the agent got it right. the framework got it wrong.
Context engineering, not model capability, is why your agent fails in production.
March 8, 2026It was 2:14pm on a Tuesday and I was reading benchmark logs I didn't need to read.
I wasn't looking for anything. I was supposed to be done with this. But something felt off about the results, so I pulled the raw step trace and started going through it manually.
Step 3. The model produced tenure_max=12 and charges_min=70.
I checked against the ground truth. Correct. Exactly correct. The model had solved the problem. I almost closed the tab.
I kept reading.
Step 4. The framework hit a parse failure. Not on the values. On the format the values were wrapped in. The answer was right. The container was wrong. The framework did what frameworks do when they fail to parse. It asked the model to try again.
Step 5. tenure_max=14. charges_min=disabled.
I sat with that for a while.
The model didn't fail. The framework buried the correct answer in an error message, asked the model to reconsider, and the model reconsidered. It produced a confident, coherent, completely wrong answer. The retry mechanism had destroyed a solved problem.
This is the thing nobody is saying clearly enough about agents right now.
The failure is almost never the model.
It's the context the model is reasoning over when it fails.
Everyone building agents in 2026 has the same mental model. Bigger context window means smarter agent. More history means better decisions. Append everything and let the model sort it out.
This is wrong.
The context window is not memory. It is attention. And attention is finite. Not finite in the sense of running out of space. Finite in the sense that every token you add competes with every token already there for a fixed budget of processing.
Anthropic has a name for what happens when that budget dilutes. Context rot. As context length grows, the model's ability to accurately locate and reason about what matters degrades. The critical constraint from step one gets buried under the noise of steps four through forty. The model doesn't forget it. It stops being able to find it in the pile.
Million-token context windows made this worse. Not better. I know that sounds backwards. It is still true. A larger window doesn't give the model more attention to work with. It gives the model more material to spread the same attention across. You don't get a smarter model. You get the same model now responsible for a bigger haystack.
"But the benchmarks show long-context models can find the needle—" That's retrieval. One fact in a long document. Agents don't do retrieval. Agents reason. Across decisions. Across steps. Across a context that accumulates with every action they take.
Retrieval and reasoning are not the same demand. The model can find the fact. It cannot always reason well about that fact in relation to a decision made thirty steps ago, given seventeen other things that happened in between. The window is big enough. The attention isn't.
I ran the logs on a second framework. Same task. Different architecture.
This one treated context as a compiled view instead of an append log. Not what happened in total. What is currently relevant to the next decision. At each step, the agent carried only what the next step actually needed. Everything else lived in external memory and was retrieved when relevant.
Step 3. tenure_max=12. charges_min=70. Correct.
Parse failure. Retry boundary.
The retry stripped the noise. Kept the prior reasoning in view. Asked the model to fix the format, not reconsider the answer. The model fixed the format.
Same model. Different context management. Different outcome.
That's the entire discipline. The model is the same. The hardware is the same. The task is the same. The thing that determines whether the agent reasons well or reasons over noise is what you chose to put in the window and when you chose to remove it.
Every token you add is a vote against every token already there.
Before you add something to the context, the question is not "might this be useful." It is "is this necessary for the next decision and nothing else." If you can't answer yes with confidence, it doesn't go in. The model does not benefit from the context of everything you did. It benefits from the context of what it needs to do next.
Multi-agent architectures are an attempt to escape this problem by distributing context across multiple windows instead of accumulating it in one. Each agent gets a clean window. Bounded scope. No history pollution.
The instinct is right. The execution is usually wrong.
Every handoff between agents is a compression event. Agent A finishes and produces a summary for Agent B. Something is lost in that summary. An assumption that was obvious in Agent A's context is not present in what Agent B receives. Agent B makes a decision on an incomplete picture of what Agent A actually did. The coordination overhead is not latency. It is semantic loss at every seam.
The only multi-agent pattern that works reliably in production is not collaboration. It is sequential specialization with typed handoffs. Agent A does a bounded task and produces a structured output with a verified schema. The orchestrator validates the schema. Agent B receives the validated structured output. Not a summary. Not a natural language description. The typed, verified artifact. The handoff is checked. The loss is bounded. The system is debuggable.
That is not the vision in the pitch deck. It is the only version that survives contact with production.
One more thing.
A single misbehaving agent session stuck in a reasoning loop can exhaust your entire daily token budget in minutes. Not your hourly budget. Your daily budget. The cost asymmetry is violent. One short prompt to start it. One hundred thousand tokens per minute once it loops.
You cannot recover from this reactively. By the time you notice, the money is gone.
Hard circuit breakers. Not soft warnings. Hard stops. Max iterations per session enforced in code before execution runs. Global timeout on the full chain. Deduplication on tool calls: before the agent calls a tool, check the last five steps semantically. If the agent is rephrasing the same failed request, block it and terminate. Do not ask the model to handle this. The model is inside the loop. It cannot see the loop from inside it.
The agents that are working in production right now are not the most autonomous ones.
They are the most carefully bounded ones. Tight context. Typed contracts between components. Explicit resource budgets with hard termination. Tool call deduplication. Retry boundaries with surgical context management.
The agent is not the intelligence in the system.
The context is.
Manage the context and the agent reasons well. Pollute the context and the agent reasons confidently over noise and you get noise back, formatted and structured and completely wrong.
the model had the right answer at step 3.
the framework failed to parse the container. the framework asked the model to reconsider. the model reconsidered.
no reasoning failure. no model failure. one retry boundary with no context surgery.
that's the whole discipline.
the agent didn't lose the answer. the framework buried it in noise and asked again.
i write these when i have something worth saying. no schedule. no algorithm. if you want to know when the next one goes up -- leave your email.
no spam. no sequence. just the note, when it exists.