Chapter 4 · Measuring & Evolving the Harness · Lesson 4.2

Building an Eval Set

A handful of your own real tasks with known-good outcomes becomes the yardstick you run after every harness change.

What an eval set is

The last lesson showed why "it feels better" can't tell you whether a harness change helped. This lesson gives you the thing that can: an eval set - a small, fixed set of representative real tasks you rerun after every change. Each item in it is a golden task: one real job with a known-good outcome and a clear done-when, so a pass or a fail is unambiguous. That's the whole idea. Not a suite, not a framework - a short list of jobs you already do, each with a right answer written down next to it.

Measure on your tasks, not public benchmarks

The obvious move is to reach for a public leaderboard. Don't lean on it. Public benchmarks get contaminated - models train on the very problems the benchmark scores - and they get gamed, since the numbers you see are often reported by the vendor selling the model. A high public score may say nothing about how the model does on your work. Five real tasks from your own week are a truer signal than any headline figure.

Measure on the work you actually do - your own real tasks are a truer signal than a public number a vendor reported about itself. Artificial Analysis Coding Agent Index, via The best AI coding agents

The write-up above makes the same point from the other direction: judging an agent on a full, real task beats scoring the model in isolation. Your eval set is exactly that - real tasks, judged whole.

What makes a good golden task

Three properties, and a task needs all three:

That third one is the same idea as the done-when from Lesson 1.4: a testable success criterion, not a vibe. "The agent handled it well" isn't checkable; "the returned list is sorted newest-first and non-empty" is. And you build each task the way you build any harness piece - by working backwards from a behaviour you need the agent to deliver, then writing the check that proves it did.

Keep it small and cheap

Three to ten tasks you can rerun in a couple of minutes beats two hundred you never run. The point of the set is that you use it - after every rule change, every new hook, every model bump - so it has to stay cheap enough to run on a whim. Start small.

It grows the same way your rules file grew: one task at a time, each earned. When a new failure mode shows up in real work, you add a golden task that would have caught it. That's the ratchet pointed at your eval set - it only ever tightens, and only after a real failure.

Anatomy of a golden task
Input:        the real prompt/input you'd give the agent
Known-good:   what a correct result looks like
Check:        one line that decides pass or fail

Check yourself

An eval set is best built from -

Measure on the work you actually do. Public benchmarks get contaminated (models train on them) and gamed (vendor-reported), so a public number may not reflect your work; your own real tasks are the truer signal.

A good golden task must be -

Representative, bounded, and objectively checkable - a test passes, output matches a shape, a value is in range. That checkable outcome is the same done-when idea from Lesson 1.4, worked backwards from the behaviour you need.

You add a new golden task when -

The ratchet, applied to your eval set: a new failure mode in real work earns a new task that would have caught it. Keep the set to the 3-10 tasks you'll actually rerun.

Do this now (10 min)

Write three golden tasks from your real work. For each one, write three lines:

  1. The input you'd give the agent (the actual prompt or request).
  2. What correct looks like - the known-good outcome.
  3. The single check that decides pass or fail.

Save them in a file. That file is your eval set - the yardstick you'll run in the next lessons every time you change the harness.

I'm your teacher - ask freely. Got a task that's too fuzzy to grade - "make the summaries better", "handle errors well"? Paste it and I'll help you turn it into a golden task with a one-line check you could actually run.

Go deeper

Primary source (read this): the Artificial Analysis Coding Agent Index write-up - The best AI coding agents - on why full-stack, real-task evaluation beats isolated model scores.

Wisdom (test it on people): r/ChatGPTCoding - where people compare how models do on the work they actually ship, not the leaderboards.