Is Your Agent a Savant?
18 Nov 2025 · 7 min readAccuracy matters in operations; a wrong root cause is worse than a slow one. So as we built an agent for incident investigation, we did what any team would: we added structure to improve it. The agent uses a lead-task architecture where a lead agent dispatches task agents, each specialized for log exploration, metric analysis, topology traversal, or code inspection. Over months, we layered on stricter investigation protocols, more sub-agents, and richer observation formats. Accuracy on complex multi-service failures improved.
But a side effect emerged. We tracked the H/E ratio — mean latency on hard (H) scenarios over mean latency on easy (E) ones (single-resource misconfigurations) — over time, holding the difficulty labels fixed across the measurement window so the ratio reflects effort shift, not reclassification. Early in the project it sat around 2.4×. Hard investigations took more than twice as long as easy ones, which is roughly what you’d expect. Over the following months, as we added structure, the ratio drifted toward 1.1×. The agent was becoming a savant, applying the same exhaustive investigation to everything, spending 5+ minutes on problems that warranted 2. On straightforward scenarios, a minimal single-agent baseline reached the same root cause faster, within a percentage point on accuracy.
In operations, this waste has real cost. Every minute of investigation is a minute of ongoing customer impact: mean time to resolution (MTTR) is the metric that matters. But the lesson generalizes well beyond incident response:
Any harness optimized for the hardest cases will tend toward uniform thoroughness, regardless of what it’s actually looking at.
From inside the protocol, every investigation feels necessary.
The asymmetry nobody notices
The reason this pattern is so easy to miss is that harness improvements pass every test we normally run. They raise accuracy on hard cases. They don’t drop accuracy on easy cases. Evals look great. What they silently shift is how much compute each case consumes, and that shift is invisible unless you’re explicitly measuring effort as a function of task difficulty.
The model-side version of this has a name. Chen et al. (2024) call it overthinking — reasoning models spending excessive compute on trivial problems. OpenAI, in its GPT-5.1 prompting guide, markets the model as “better calibrated to prompt difficulty, consuming far fewer tokens on easy inputs,” implicitly acknowledging that earlier models weren’t. The pattern has been named at the model level. The harness-level version is the same story without a name.
Test-time scaling research characterizes the compute-vs-quality curve. But the two places where it matters most in practice — production agent harnesses and the standard agent benchmarks (SWE-Bench, τ-Bench, GAIA) — mostly report mean accuracy, mean tool calls, mean latency. A mean hides the thing that matters. Once you do plot effort against difficulty, the pattern becomes obvious: as harnesses mature, the gap between easy and hard shrinks. Each new sub-agent, each new protocol step, each new “always check X before concluding Y” adds fixed cost to every trajectory, including the ones that didn’t need it. Structure compounds. Speed on the easy cases is the first thing to go.
Parametric knowledge can’t save you here
The tempting response is: the model should just know. Let the model decide how deep to go.
This doesn’t work, and the reason is structural. A model can reason about environment-specific facts if you put them in context. That’s what context is for. The real problem is that the facts that determine how deep to investigate are spread across topology, tool behavior, recent deploys, prior incidents, and operator preferences, and gathering them on every query is itself an investigation. Unless the harness has somewhere to accumulate those facts and keep them fresh, the model is either reasoning from impoverished context or paying for a full discovery pass just to decide how much to discover.
So when you ask the model to self-regulate depth, you’re asking it to calibrate against a distribution it has never seen and to fetch the evidence it would need to calibrate, at query time. It defaults to the safe behavior. Under a protocol that rewards thoroughness, that means doing everything.
There is genuine progress on training models that adapt their reasoning effort. Recent work (Kleinman et al. 2025) shows that models can be fine-tuned to vary chain-of-thought length continuously in response to a user-supplied “effort” parameter, with the model learning to allocate more thinking to harder problems. But this doesn’t dissolve the problem; it relocates it. The model can now modulate effort given a budget. Something still has to set that budget, per query, before reasoning starts. The model can respond to a dial. Someone still has to turn it.
This is what separates adaptive effort from raw capability. A more capable model investigates better at every depth; an adaptive-effort model modulates within the depth it’s given. Neither, on its own, chooses the right depth.
Where adaptive depth actually comes from
I’ve seen this shape before. In a previous life I spent years working on ad-spend optimization, where we’d plot net profit against ad spend and watch the curve rise, plateau, and eventually bend back down. There was always a green zone where the next dollar still bought something, a red zone where it didn’t, and the whole job was knowing where you were on that curve. Compute in an agent harness behaves the same way. The savant is an agent that’s drifted past the peak, paying for investigation that no longer buys accuracy, and most of what makes a harness good is what keeps it from drifting there.
Keeping an agent at the peak isn’t a property of the model. It’s a property of the harness. Three capabilities keep showing up in everything that’s worked for us, and everything that hasn’t worked has been missing at least one of them:
Validated operational memory. The harness has to accumulate structured knowledge about the environment it operates in — topology, tool behavior, recurring failure patterns — and it has to keep that knowledge honest by re-validating it against the live system. Without this, the agent rediscovers the same facts on every run, and every run looks equally novel. With it, the agent can recognize when a situation is familiar, which is the precondition for going fast. The key word is validated: a memory that drifts out of sync with reality is worse than no memory, because the agent will confidently apply wrong priors.
A spectrum of execution paths, not a single protocol. If the only path through the harness is “investigate → hypothesize → verify,” the agent will always pay the full cost. Useful harnesses expose multiple paths: propose-and-verify when confidence is high, full exploration when it isn’t, and — at the extreme — no investigation at all when the answer is already known from a related incident. The harness designer’s job is not to make the single path faster. It’s to make sure the agent has a shorter path available when the situation allows.
A mechanism for choosing the path before paying for it. This is the hardest part, and it’s where I think current agent harnesses are weakest. Choosing depth after you’ve already explored is not a saving. The choice has to happen up front, based on what the agent knows about the problem and about itself. Some of this can come from structured representations built during the run: a lightweight knowledge graph of the current incident, confidence estimates over candidate hypotheses, explicit tracking of what’s been ruled out. Some of it has to come from the operator, who often knows at a glance that a problem is routine and can steer the agent past exploration it doesn’t need.
What I’d tell another harness designer
If I were starting over on an agent harness today, three things would change about how I’d build it:
-
Measure the H/E latency ratio from day one. If it’s trending toward 1 while accuracy is flat, the harness is accumulating savant behavior faster than it’s learning.
-
Treat memory as a first-class component of the harness, not a retrieval add-on. Retrieval-augmented generation gives the agent more text. Validated operational memory gives it the ability to skip steps. These are different things, and only the second one fixes the savant problem.
-
Design the harness as a menu of paths, and make the path-selection decision explicit. Any time the harness forces every trajectory through the same protocol, ask what signal would let you short-circuit it, and make that signal part of the input.
The savant problem doesn’t come from a bad model or a bad protocol. It comes from the default assumption that the hardest case defines the right amount of effort. The goal isn’t to make the agent investigate faster. It’s to make the agent investigate less when less is enough, and to know the difference.
That, more than any particular architecture choice, is what separates an agent that’s merely accurate from one that’s actually intelligent about its own effort.
The economic structure of this problem (response curves, marginal returns, the portfolio allocation of compute across agents) is something I’ll come back to in a future post.