Chapter 4 · Measuring & Evolving the Harness · Lesson 4.5

Knowing a Change Actually Helped

The win: run your change against your eval set - and keep it only if the improvement clears the noise.

The method: before and after

You changed something in your harness - a new rule, a sharper spec, a different model - and it feels better. Does it feel better, or is it better? The only honest answer comes from your eval set (from Lesson 4.2): run your golden tasks with the old harness and write down the results, make one change, then run the exact same tasks again. Same tasks, same measurement, before and after.

The rule that makes this work is change one thing at a time. If you swap the model and rewrite a rule in the same pass and the score moves, you can't tell which move did it - or whether the two cancelled out. One change per run, or you can't attribute the result to anything.

Beware the noise: small gaps are ties

Now read the numbers carefully, because a lot of what you'll see is noise, not signal. Benchmark scores are often vendor-reported, and small differences in the scaffold around a model shift results on their own - so a one- or two-point move may be nothing at all. There's even evidence that swapping the harness can land inside the margin of error for a strong model: a scaffold that helps one model can quietly hurt another (see the Artificial Analysis Coding Agent Index writeup, Firecrawl). Treat small gaps as ties, not wins.

"A small win on a small sample is not a win." Artificial Analysis Coding Agent Index, via Firecrawl

Watch for regressions

Here's the trap: a change that fixes task A can quietly break task B. That's exactly why your eval set has several tasks and not one - a single golden task would hide the damage. So don't only check that the thing you were fixing improved; check that everything else still passes. A verify-and-test loop catches this cheaply: one reported result had a test-and-verify approach cutting regressions by about 70% (per My Experiments With AI). And when the call is close, a cross-agent review - a second, independent agent - gives you a second opinion on whether the change is a real improvement or just a reshuffle.

Close the ratchet honestly

Now the decision. Keep the change only if it clearly beats the old harness across the whole set without regressing any other task. If the win sits inside the noise, or it fixed one thing and broke another, revert it - no hard feelings. This is what keeps the ratchet honest: it tightens on evidence, not on hope. A change you kept because it "felt" better is exactly the unearned rule the ratchet is supposed to keep out.

Before you keep a change

Check yourself

To attribute a result to your change, you should -

Change two things at once and the score moves, you can't tell which one did it. One change, run the whole eval set before and after - that's the only clean read.

A one-or-two point move on a small sample is -

Benchmark scores are often vendor-reported, and scaffold differences shift results, so small gaps are ties. A small win on a small sample can sit inside the margin of error.

A change that fixes one task but breaks another should be -

A fix for task A can break task B - that's why the eval set holds several tasks. Don't keep it on the strength of one win; revert or rework until it beats the old harness without breaking others.

Do this now (5 min)

Take the three golden tasks you wrote in Lesson 4.2. Then:

  1. Run all three with your current harness and note the results.
  2. Make one small change - a rule, a spec tweak, or a model swap.
  3. Run the same three again.

Decide: real win, or noise? If it clearly beat the old set with nothing broken, keep it. Otherwise, revert.

I'm your teacher - ask freely. Not sure whether your before/after gap is a real win or just noise? Paste both runs and we'll read them together. And this completes Chapter 4 - so you can ask me to open Chapter 5, or revisit any lesson that still feels shaky.

You've completed Chapter 4

One breath, the whole chapter: you learned to stop trusting vibes, to build an eval set of real golden tasks, to read failures and not just pass rates, to automate the improvement loop so the harness keeps pace with the models, and finally to prove a change helped before you keep it. That last habit is the point of the whole chapter - measurement is what makes the ratchet honest. Chapter 5 takes this out of your own hands and into a team: working across many agents, and sharing one harness so everyone inherits the same earned rules.

Go deeper

Primary source (read this): the Artificial Analysis Coding Agent Index writeup - Best AI coding agents, on why small gaps are ties and why vendor-reported scores need caution.

Secondary: My Experiments With AI - Why the Harness Wins, on the test-and-verify loop that cuts regressions.

Wisdom (test it on people): the HumanLayer community - a good place to compare how other teams decide a change really earned its place.