Chapter 4 · Measuring & Evolving the Harness · Lesson 4.5
Knowing a Change Actually Helped
The win: run your change against your eval set - and keep it only if the improvement clears the noise.
- Chapter 0 · Sprint Zero
- Chapter 1 · The ratchet & the practice loop
- Chapter 2 · Spec-driven development in depth
- Chapter 3 · Scaling & trusting the harness
- 4.1 · Why vibes aren't enough
- 4.2 · Building an eval set
- 4.3 · Reading failure, not pass rates
- 4.4 · The self-improving loop
- 4.5 · Knowing a change helped
The method: before and after
You changed something in your harness - a new rule, a sharper spec, a different model - and it feels better. Does it feel better, or is it better? The only honest answer comes from your eval set (from Lesson 4.2): run your golden tasks with the old harness and write down the results, make one change, then run the exact same tasks again. Same tasks, same measurement, before and after.
The rule that makes this work is change one thing at a time. If you swap the model and rewrite a rule in the same pass and the score moves, you can't tell which move did it - or whether the two cancelled out. One change per run, or you can't attribute the result to anything.
Beware the noise: small gaps are ties
Now read the numbers carefully, because a lot of what you'll see is noise, not signal. Benchmark scores are often vendor-reported, and small differences in the scaffold around a model shift results on their own - so a one- or two-point move may be nothing at all. There's even evidence that swapping the harness can land inside the margin of error for a strong model: a scaffold that helps one model can quietly hurt another (see the Artificial Analysis Coding Agent Index writeup, Firecrawl). Treat small gaps as ties, not wins.
Watch for regressions
Here's the trap: a change that fixes task A can quietly break task B. That's exactly why your eval set has several tasks and not one - a single golden task would hide the damage. So don't only check that the thing you were fixing improved; check that everything else still passes. A verify-and-test loop catches this cheaply: one reported result had a test-and-verify approach cutting regressions by about 70% (per My Experiments With AI). And when the call is close, a cross-agent review - a second, independent agent - gives you a second opinion on whether the change is a real improvement or just a reshuffle.
Close the ratchet honestly
Now the decision. Keep the change only if it clearly beats the old harness across the whole set without regressing any other task. If the win sits inside the noise, or it fixed one thing and broke another, revert it - no hard feelings. This is what keeps the ratchet honest: it tightens on evidence, not on hope. A change you kept because it "felt" better is exactly the unearned rule the ratchet is supposed to keep out.
- One change at a time - so any movement can be traced to a single cause.
- Run the whole eval set both ways - old harness and new, same golden tasks.
- Check nothing regressed - not just the task you were fixing; every other task too.
- Keep only if the win clears the noise - a one- or two-point wobble is a tie. Otherwise, revert.
Check yourself
To attribute a result to your change, you should -
Change two things at once and the score moves, you can't tell which one did it. One change, run the whole eval set before and after - that's the only clean read.
A one-or-two point move on a small sample is -
Benchmark scores are often vendor-reported, and scaffold differences shift results, so small gaps are ties. A small win on a small sample can sit inside the margin of error.
A change that fixes one task but breaks another should be -
A fix for task A can break task B - that's why the eval set holds several tasks. Don't keep it on the strength of one win; revert or rework until it beats the old harness without breaking others.
Do this now (5 min)
Take the three golden tasks you wrote in Lesson 4.2. Then:
- Run all three with your current harness and note the results.
- Make one small change - a rule, a spec tweak, or a model swap.
- Run the same three again.
Decide: real win, or noise? If it clearly beat the old set with nothing broken, keep it. Otherwise, revert.
You've completed Chapter 4
One breath, the whole chapter: you learned to stop trusting vibes, to build an eval set of real golden tasks, to read failures and not just pass rates, to automate the improvement loop so the harness keeps pace with the models, and finally to prove a change helped before you keep it. That last habit is the point of the whole chapter - measurement is what makes the ratchet honest. Chapter 5 takes this out of your own hands and into a team: working across many agents, and sharing one harness so everyone inherits the same earned rules.
Go deeper
Primary source (read this): the Artificial Analysis Coding Agent Index writeup - Best AI coding agents, on why small gaps are ties and why vendor-reported scores need caution.
Secondary: My Experiments With AI - Why the Harness Wins, on the test-and-verify loop that cuts regressions.
Wisdom (test it on people): the HumanLayer community - a good place to compare how other teams decide a change really earned its place.