Chapter 3 · Scaling & Trusting the Harness · Lesson 3.5
Cost, Observability & Harness-as-a-Service
The win: know what your harness costs, watch what it actually does, and know when to stop tuning it.
- Chapter 0 · Sprint Zero
- Chapter 1 · The ratchet & the practice loop
- Chapter 2 · Spec-driven development in depth
- 3.1 · Tools: fewer and sharper
- 3.2 · MCP & tool safety
- 3.3 · Filesystem, Git & sandboxes
- 3.4 · Long-horizon autonomy
- 3.5 · Cost, observability & HaaS
What your harness costs
Two people can run the same model and get near-identical output, yet one of them pays a small fortune and the other pays pennies. The difference isn't the model - it's the harness (everything wrapped around the model: prompts, tools, and how you feed it context).
You pay per token - roughly, per chunk of text the model reads or writes. So cost is driven by how much text your harness pushes through it. The three usual culprits:
- Verbose system prompts - a bloated rules file that gets re-sent on every single turn.
- Re-reading specs and plans every turn - the same documents pasted back in again and again instead of being read once.
- Tool sprawl - dozens of tool definitions crammed into context, most of which the task never uses (that's Lesson 3.1: fewer, sharper tools).
One thing that looks like a cost but usually isn't: writing a spec first. Spec-driven work (Lesson 1.4) adds roughly 20-40% more tokens up front, but it usually saves more than that by avoiding rework - the agent doesn't wander off and build the wrong thing twice. Spend a little to read a plan once; save a lot by not redoing the job.
You can't improve what you can't see
Back in Lesson 1.3 we said a good harness makes success silent and failures verbose. That's also an observability rule: you don't need to watch the runs that worked - you watch the ones that broke. Observability just means being able to see what your agent actually did, not what it claimed it did.
To improve a harness you have to see it first: which tools it called, where it looped in circles, and where it failed. Those recorded runs are its trajectories - the play-by-play of a task. Addy Osmani frames harness engineering as observability-driven work: you watch real runs, and the failures you see are what tell you the next fix to make (Agent Harness Engineering). No guessing - the harness tells you where it hurts.
Knowing when to stop tuning
Here's the discipline most people miss: the harness is disposable (Lesson 0.2). A better model, six months from now, may simply not need the scaffolding you built - it will do on its own what you had to prop up. So don't polish scaffolding the next model will outgrow. That's wasted work.
The rule: tune only where today's model is actually weak, and stop the moment the gain is noise. This is the ratchet again - you add a fix only after a real failure, and you remove it once the model no longer needs it. A harness that only ever grows is a harness nobody is pruning.
Harness-as-a-Service (HaaS)
Where is all of this heading? Toward not writing your own loop at all. Instead of wiring an agent together on top of raw model APIs (send text, get text back), you build on a harness SDK - a ready-made runtime like the Claude Agent SDK or the Codex SDK that hands you the loop, the tool-calling, and the context handling out of the box. Osmani calls this the productised end state: harness engineering becomes something you consume, not something you hand-build each time. You still tune the parts that are yours - the rules, the tools, the checks - but the plumbing comes for free.
- Cost per task - what does one real job actually cost in tokens?
- Failure trajectories - where did runs loop, stall, or break?
- Is this scaffolding still earning its keep? - or is the current model past needing it?
Check yourself
The biggest harness cost drivers are -
Cost is tokens pushed through the model: bloated system prompts, re-reading specs/plans every turn, and tool sprawl. A spec adds ~20-40% tokens but usually saves more by avoiding rework.
To improve a harness you must first -
Observability comes first: watch the trajectories - which tools it called, where it looped, where it failed. The failures you can see are what tell you the next fix to make.
You should stop tuning when -
The harness is disposable - a better model may delete your scaffolding. Tune only where today's model is weak, and stop when the improvement is indistinguishable from noise.
Do this now (5 min)
Open your agent's usage or cost view for one recent, real task. Find the single biggest driver:
- a bloated system prompt or rules file re-sent every turn,
- the same files re-read over and over, or
- too many tool calls (or too many tools loaded).
Whatever's biggest, that is your next harness fix. Name it, and you've turned the ratchet one more notch.
You've completed Chapter 2
One breath, both chapters. Chapter 1 was the core loop: the ratchet that turns every failure into a durable fix, placed in its cheapest home - a rule, a hook, a spec, clean context, or a review. Chapter 2 was running that loop for real and safely: sharp tools, trusted tools, a solid filesystem-and-Git substrate, structure for long-horizon jobs, and now watching what it all costs and does. The throughline never changes: the model is the commodity, the harness is your edge - and a good harness stays lean, safe, observable, and disposable. Now go keep ratcheting on real work.
Go deeper
Primary source (read this): Addy Osmani - Agent Harness Engineering, on observability-driven tuning and where harness SDKs are taking the field.
Secondary: the Artificial Analysis Coding Agent Index writeup - Best AI Coding Agents, for the same-model, different-cost numbers.
Wisdom (test it on people): the HumanLayer community - a good place to compare notes on what actually earns its keep in a harness and what to prune.