The Agentic Harness Trilemma: Good, Fast, Cheap

#agents #ai

15 Dec 2025 · 7 min read

Designing AWS frontier agents this year has reminded me of the famous CAP theorem from database classes: consistency, availability, partition tolerance — pick two! With autonomous agents it’s not really an impossibility theorem like CAP, but there is a Pareto frontier across accuracy, latency and cost that every agent must navigate. The interesting question is where on the frontier your use case demands you sit, and whether your architecture is even on the frontier at all, or stuck at a dominated point below it.

In classical LLM evaluation, cost means price per token and latency means time to generate. For autonomous agents, both change. Latency (L) is end-to-end wall-clock time – planning, tool calls, retrieval, retries, and orchestration overhead included. Cost (C) is realized trajectory-level token spend – not listed API price, but actual tokens consumed across the complete trajectory. Good (G) is task-level success: did the agent do well on the task? With those defined, here is the claim this post is built around:

For any frontier agent built on a fixed base model, the achievable set of (G, L, C) — maximizing G, minimizing L and C — has a Pareto frontier whose shape and position are determined primarily by harness choices: which models get assigned to which roles, when to retrieve, whether to parallelize, how to route by difficulty. These choices are under-discussed relative to model-level scaling, even though for most agent builders they matter more to the final product.

Most of the public conversation about these tradeoffs happens at the model level: scaling laws, reasoning budgets, the o1 / R1 family of models that learn to allocate thinking internally. Those developments are real and they push the frontier outward. But if you are shipping an agent today (OpenClaw, Manus, AWS Frontier Agents), the model is a fixed input to your problem. The knobs you actually control are at the harness level: which models to assign to which roles, when to retrieve, how to parallelize, how to coordinate multiple sub-agents, how to handle tool failures, how to persist state across sessions, how to fall back when something breaks mid-trajectory. A reasoning model gives you a better single call. It does not give you a working autonomous agent. A large and growing fraction of the gap between “model that can reason” and “agent that can act reliably over a real task distribution” is harness, especially for multi-step tool use, long-horizon tasks, and places where model-internal planning is not enough. That harness gap is where the trilemma actually binds.

Model providers cannot make these harness decisions for you (today), because the right tradeoff depends on your application’s utility function. A high-frequency trading agent, a coding agent, and an incident-response agent face the same frontier but should navigate it very differently.

Vertex 1: Cheap and fast, but wrong on hard

The most intuitive vertex. Run every task through a single, cheap model call. No retrieval, no verification, no multi-step reasoning. You get minimal latency and minimal cost and the policy looks fine on aggregate but its failures concentrate on the hardest tasks.

How much this matters depends on your use case and whether “hard” correlates with what you actually care about. For a high-frequency trading agent, being quick and cheap might consistently outperform slower high-quality reasoning. The hard-to-predict stocks may also not be the ones with large expected moves. Win Fast or Lose Slow (Kang et al. 2025) makes this point directly: in real-time decision tasks, a 20% reduction in response time consistently beats slower but higher-quality reasoning, improving win rates by up to 80% in competitive gaming and daily yields by up to 26.5% in trading.

Vertex 2: Good and (near) fast, but expensive

This vertex has two flavors depending on which property you stress.

Good and near fast, but expensive — parallel ensembles. Want high accuracy with low latency? Run everything in parallel. Deploy N sub-agents simultaneously. Wall-clock time stays close to single-agent runtime (you wait for the slowest, not all N combined), while accuracy improves through ensemble effects (self-consistency, best-of-N, debate). And cost multiplies by N. The agent is buying speed with cost. An incident-response agent might lean into this: resolving a critical sev-1 outage fast enough outweighs any compute bill.

Eventually good and fast, but expensive — async refinement. Return a fast preliminary answer now and refine asynchronously in the background. This is analogous to BASE vs ACID in distributed databases: trade strong consistency for availability and speed, accepting that consistency improves over time. You get fast and eventually good, but you pay to run two versions of the agent (a fast drafter and a slow refiner), plus the infrastructure to coordinate them and reconcile revisions with the user.

Vertex 3: Accurate and cheap, but amortized slow

A lot of agent harnesses are built around mechanisms that look like free lunches: memory retrieval, routing classifiers, verification passes. For tasks where the mechanism fires usefully (retrieval finds a relevant runbook, the router catches a hard case, the verifier flags a bad tool call), the agent completes accurately, cheaply, and often much faster than it would have otherwise. That looks like breaking the frontier. But any “look before you leap” step has the same shape: on a single task where it helps, the system appears accurate, faster, and cheaper. There is no per-task penalty. The penalty exists only in expectation: every task pays the overhead of checking, even when the check finds nothing. The expected latency under a routing policy is:

E[L] = t_route + p_simple · L_simple + (1 − p_simple) · L_complex

Compare to the no-routing baseline, which always runs the full pipeline: E[L_baseline] = L_complex. Routing beats baseline iff:

t_route < p_simple · (L_complex − L_simple)

That’s the load-bearing inequality. The routing tax t_route is paid on every single task; the benefit (L_complex − L_simple) is realized only on the p_simple fraction where the router actually catches an easy case. For a production agent checking a vector store or markdown memory files on every incident, that’s ~500ms of overhead on 100% of tasks, helping only a smaller percent of tasks. Plug into the inequality and you will often find you have built a feature that improves hard tasks but makes the amortized system slower across the distribution which is usually dominated by easy ones.

The same analysis applies to multi-agent spawning: coordinating N sub-agents adds setup and merge overhead on every invocation, while the specialization benefit only lands on the fraction of tasks complex enough to need it. If the task is easy, do it inline rather than orchestrate a handoff.

This is the vertex I find most interesting, and it connects to a deeper question about how agents should allocate effort. I explore it in detail in the next post.

Navigating the Frontier

The trilemma is not a constraint to solve; it is a surface to navigate. Five learnings from experience:

1. Measure the task distribution before you pick an architecture. The trilemma binds over distributions, not individual tasks. Classify a few hundred production traces by difficulty and frequency before choosing between single-call, cascade, or multi-agent. The right architecture is almost entirely determined by the shape of that distribution, not by the peak capability you designed for. The pattern I keep seeing is teams over-investing in complex handling for tail cases that turn out to be a small share of traffic, and under-investing in the boring cases that dominate it.

2. In my experience, difficulty prediction is the single highest-leverage harness intervention. Damani et al. (ICLR 2025) showed that a probe on LLM hidden representations can predict task difficulty at near-zero overhead, enabling meaningful compute reduction with minimal quality impact. Even a coarse signal (a logistic regression on cheap query features) beats a uniform policy. The expected value of imperfect prediction is almost always worth the cost of running it.

3. Parallelism is a last resort, not a default. Cascade routing (Dekoninck et al. 2025), where a single agent runs first and escalates only when confidence is below threshold, dominates uniform multi-agent policies on every distribution I have measured. The silent failure mode is running N agents on tasks where one would have sufficed; the cost compounds invisibly across production traffic and nobody notices until the bill shows up.

4. Retrieval is often paying its tax without earning its benefit. Gate it. Use a fast signal to decide whether to retrieve rather than querying on every task. TARG’s 70–90% retrieval reduction with no accuracy loss is the positive finding; the corollary is that most agents in production today are paying the retrieval tax many times more often than they benefit from it.

5. Budget on realized trajectory cost, not listed token price. Thinking budgets, retries, and tool-call cascades inflate realized cost unpredictably across runs of the same query. I have seen large variance in spend on identical inputs. Cost instrumentation at the trajectory level, not the per-call level, is table stakes for any production agent.

The frontier moves. Better routing, adaptive retrieval, and plan caching push it outward. The practical goal is not to escape the trilemma but to ensure your agent is on the frontier, not stuck at a dominated interior point.

purakjain Writing About

The Agentic Harness Trilemma: Good, Fast, Cheap

Vertex 1: Cheap and fast, but wrong on hard

Vertex 2: Good and (near) fast, but expensive

Vertex 3: Accurate and cheap, but amortized slow

Navigating the Frontier

Related posts

Ancient Toltec Wisdom: The Four Agreements 03 Aug 2020

Playing Chess Like Humans 16 Mar 2018