Chapter 4 · Measuring & Evolving the Harness · Lesson 4.2
Building an Eval Set
A handful of your own real tasks with known-good outcomes becomes the yardstick you run after every harness change.
- Chapter 0 · Sprint Zero
- Chapter 1 · The ratchet & the practice loop
- Chapter 2 · Spec-driven development in depth
- Chapter 3 · Scaling & trusting the harness
- 4.1 · Why vibes aren't enough
- 4.2 · Building an eval set
- 4.3 · Reading failure, not pass rates
- 4.4 · The self-improving loop
- 4.5 · Knowing a change helped
What an eval set is
The last lesson showed why "it feels better" can't tell you whether a harness change helped. This lesson gives you the thing that can: an eval set - a small, fixed set of representative real tasks you rerun after every change. Each item in it is a golden task: one real job with a known-good outcome and a clear done-when, so a pass or a fail is unambiguous. That's the whole idea. Not a suite, not a framework - a short list of jobs you already do, each with a right answer written down next to it.
Measure on your tasks, not public benchmarks
The obvious move is to reach for a public leaderboard. Don't lean on it. Public benchmarks get contaminated - models train on the very problems the benchmark scores - and they get gamed, since the numbers you see are often reported by the vendor selling the model. A high public score may say nothing about how the model does on your work. Five real tasks from your own week are a truer signal than any headline figure.
The write-up above makes the same point from the other direction: judging an agent on a full, real task beats scoring the model in isolation. Your eval set is exactly that - real tasks, judged whole.
What makes a good golden task
Three properties, and a task needs all three:
- Representative - a real thing you actually do, not a puzzle you invented to be clever.
- Bounded - one clear job, not "rebuild the app". Small enough that its outcome is a single yes or no.
- Objectively checkable - the pass/fail is something a machine or a glance can decide: a test passes, the output matches a shape, a value lands in range.
That third one is the same idea as the done-when from Lesson 1.4: a testable success criterion, not a vibe. "The agent handled it well" isn't checkable; "the returned list is sorted newest-first and non-empty" is. And you build each task the way you build any harness piece - by working backwards from a behaviour you need the agent to deliver, then writing the check that proves it did.
Keep it small and cheap
Three to ten tasks you can rerun in a couple of minutes beats two hundred you never run. The point of the set is that you use it - after every rule change, every new hook, every model bump - so it has to stay cheap enough to run on a whim. Start small.
It grows the same way your rules file grew: one task at a time, each earned. When a new failure mode shows up in real work, you add a golden task that would have caught it. That's the ratchet pointed at your eval set - it only ever tightens, and only after a real failure.
Input: the real prompt/input you'd give the agent
Known-good: what a correct result looks like
Check: one line that decides pass or fail
Check yourself
An eval set is best built from -
Measure on the work you actually do. Public benchmarks get contaminated (models train on them) and gamed (vendor-reported), so a public number may not reflect your work; your own real tasks are the truer signal.
A good golden task must be -
Representative, bounded, and objectively checkable - a test passes, output matches a shape, a value is in range. That checkable outcome is the same done-when idea from Lesson 1.4, worked backwards from the behaviour you need.
You add a new golden task when -
The ratchet, applied to your eval set: a new failure mode in real work earns a new task that would have caught it. Keep the set to the 3-10 tasks you'll actually rerun.
Do this now (10 min)
Write three golden tasks from your real work. For each one, write three lines:
- The input you'd give the agent (the actual prompt or request).
- What correct looks like - the known-good outcome.
- The single check that decides pass or fail.
Save them in a file. That file is your eval set - the yardstick you'll run in the next lessons every time you change the harness.
Go deeper
Primary source (read this): the Artificial Analysis Coding Agent Index write-up - The best AI coding agents - on why full-stack, real-task evaluation beats isolated model scores.
Wisdom (test it on people): r/ChatGPTCoding - where people compare how models do on the work they actually ship, not the leaderboards.