Chapter 4 · Measuring & Evolving the Harness · Lesson 4.3

Reading Failure, Not Pass Rates

A pass rate tells you IF something works; the failures tell you WHY it broke - and WHY is where the next fix lives.

A number hides the why

A pass rate is one number - "72% of your eval tasks passed". It tells you whether things are getting better, but nothing about why the other 28% broke. Two harnesses can both sit at 72% and fail in completely different ways: one keeps ignoring a rule, the other keeps picking the wrong tool. Same score, opposite fixes. So to actually improve the harness you read the failures, not the rate (per Addy Osmani, Agent Harness Engineering).

Failures come in patterns

You might assume each failure is a one-off, so there's nothing to learn from grouping them. It isn't. UC Berkeley's MAST taxonomy went through more than 1,600 real agent execution traces and found they cluster into a small set of recurring failure modes.

"Agent systems fail in patterned, recurring ways - 1,600+ traces sorted into a taxonomy of failure modes." UC Berkeley, MAST

The point isn't to memorise their exact list. It's the proof that "bucket your failures" is a real, studied method - not busywork. If a research team could sort thousands of traces into buckets, you can sort your last handful.

Where the failure signal lives

Two places carry the detail you need. First, the agent's trajectory - the step-by-step record of what it actually did, turn by turn. Second, the error text your hooks surface when a check fails. Remember from Lesson 1.3: success is silent, failures are verbose. That design means your hooks are already streaming the failure detail straight into the loop. You don't have to go digging - you just have to read it.

From buckets to your next fix

Once you can see the failures, sort them. Group the last batch into a few plain buckets - wrong assumption, missing context, ignored a rule, bad tool choice. Count each bucket. The biggest one is your next ratchet fix, and once you land it that improvement feeds the compound engineering loop. One warning: rank by frequency, not by how annoying each failure felt. The loudest failure is rarely the most common one.

Turn failures into fixes

Check yourself

Two harnesses both at 72% -

A pass rate tells you if, not why. Two harnesses at the same score can break for opposite reasons - so the fix comes from reading the failures, not the number.

MAST shows agent failures are -

MAST sorted 1,600+ real traces into recurring failure modes. Failures cluster - which is exactly why bucketing them is a real method, not busywork.

Your next fix should target -

Rank buckets by frequency, not by how annoying each felt. The biggest bucket is the fix that removes the most failures - your next ratchet move.

Do this now (10 min)

Look at your agent's last 5 failed runs. For each, write a one-word cause (assumption, context, rule, tool...) and tally them.

The most common cause is your next fix. Decide its home: a rule (it just needs to know), a hook (it must be enforced), or a reviewer (it needs judgement).

I'm your teacher - ask freely. Stuck sorting a failure into a bucket, or not sure which bucket is really the biggest? Paste your last few failures and we'll bucket them together - that sorting is the exact skill this lesson builds.

Go deeper

Primary source (read this): UC Berkeley - MAST: the Multi-Agent System Failure Taxonomy. The evidence that agent failures fall into a countable set of buckets you can act on.

Wisdom (test it on people): the HumanLayer community - a good place to compare which failure buckets actually show up most in real projects.