Chapter 4 · Measuring & Evolving the Harness · Lesson 4.1
Why Vibes Aren't Enough
"It feels better" is not evidence. To improve a harness, you have to measure it.
- Chapter 0 · Sprint Zero
- Chapter 1 · The ratchet & the practice loop
- Chapter 2 · Spec-driven development in depth
- Chapter 3 · Scaling & trusting the harness
- 4.1 · Why vibes aren't enough
- 4.2 · Building an eval set
- 4.3 · Reading failure, not pass rates
- 4.4 · The self-improving loop
- 4.5 · Knowing a change helped
Recap: a score describes the pair
Back in Lesson 0.2 you learned that the harness is everything wrapped around the model - prompts, tools, memory, checks. So a benchmark score is never a property of the model alone. It is a property of the model-and-harness pair. Change the wrapper and the number moves, even with the exact same model underneath.
So "it feels smarter" proves nothing
Here is the trap. You tweak a prompt, run the agent once, and it "feels smarter". But that single run can't tell you which of three things happened: the change helped, the change hurt and you got lucky anyway, or the change did nothing and you're reading noise. A gut feeling gives all three the same warm glow. Vibes can't separate them - and if you can't separate them, you can't tell an improvement from a coincidence.
Failures come in patterns you can count
The good news: agents don't fail in a fog. They fail in repeatable, nameable ways. UC Berkeley's MAST taxonomy read 1,600+ execution traces and sorted how agent systems actually break into a fixed list of failure modes, per the MAST paper. That matters for one reason: if a failure can be put in a category, it can be counted. And anything you can count, you can measure - which means you can watch it go up or down as you change the harness.
The chapter's job
Harness engineering is only a ratchet - a mechanism that only ever tightens - if you can tell whether each turn actually tightened anything. Without a measurement, you're just spinning the handle and hoping. That requires a yardstick, and the next lesson builds one from scratch.
- It tells helped vs. hurt vs. noise apart, so you keep the changes that work and drop the ones that don't.
- It makes the ratchet trustworthy - every turn is checked, not assumed.
- It turns "smells better" into a number you can compare run to run.
Check yourself
A benchmark score really measures -
A score is a property of the model-and-harness pair. Rebuild only the harness and the number moves (52.8% to 66.5% in LangChain's case) with the same model underneath.
Why can't a vibe judge a change?
A good feeling looks the same whether the change helped, hurt, or did nothing. One run gives all three the same glow, so vibes can't tell an improvement from luck.
MAST shows that agent failures are -
MAST sorted 1,600+ traces into a fixed set of failure modes. If a failure fits a category, it can be counted - and anything countable can be measured as you change the harness.
Do this now (3 min)
Pick one task you run your agent on often. Write the single observable outcome that means "it worked" - a test passes, the output matches a known shape, a metric crosses a threshold. Not "it looked good" - something a machine could check. That one line is the seed of your eval set (the small fixed benchmark you'll build next lesson).
Go deeper
Primary source (read this): LangChain, "The Anatomy of an Agent Harness" - shows the same model swinging double-digit points when only the harness changes.
Wisdom (test it on people): the HumanLayer community - a good place to argue out what "measurably better" should mean for your own harness.