A 20-minute tour

Agentic Harness
Engineering

The model is the commodity. The harness is your edge.

From "prompt and hope" to a measured, ratcheted practice

The shift

The models have converged

0.2 pt

open DeepSeek V4-Pro vs closed Opus 4.6 on SWE-bench

~3 mo

open-weight now lags the frontier by months, not generations

11.9 → 5.4%

gap between the #1 and #10 coding model, in one year

So "which model?" stopped being the interesting question.

The core idea

Agent = Model + Harness

The harness is everything that isn't the model: the prompts, tools, memory, checks, and the loop that runs them. "If you're not the model, you're the harness."

Why it matters

Same model. Different harness. Wildly different results.

52.8 → 66.5%

rebuild only the harness, same model (LangChain)

77 vs 93%

one Opus, two harnesses, identical tasks

32×

cost swing for near-identical code

You can't buy good agentic coding with a bigger model. You engineer the harness.

Chapter 1

The Ratchet
& the practice loop

The core habit, and the moves that compound it.

Definition

ratch·et

noun · as used in this course

A harness discipline that only ever tightens. Each agent failure earns one permanent fix - a rule, a hook, or a reviewer - so the same mistake can never recur. Nothing is added speculatively; nothing is removed except when a better model makes it redundant.

Origin: the mechanical ratchet - a gear that turns one way and never slips back. Applied here to the harness, which compounds every lesson and never unlearns one.

The one habit

When the agent gets something wrong, don't re-prompt it - change the harness so it can never get that wrong again. The ratchet only ever tightens. "The model doesn't get smarter. The harness does."

Where a fix lives

Three homes for a fix

Rule

the agent just needs to know it
a line in CLAUDE.md

Hook

must be enforced every time
a script that runs automatically

Reviewer

needs judgement a script can't make
a second agent checks it

Lesson 1.2

CLAUDE.md is a pilot's checklist

Injected into the model every single turn - precious space
Every line must be traceable to a real failure
Models reliably follow only ~150-200 instructions - keep it lean

Lesson 1.3

Hooks, not instructions

The prompt is where you steer. The harness is where you enforce. Prompt compliance ~70-90%. A hook fires 100%. Success is silent; failures are verbose.

Lesson 1.4

Spec before code

Specify states, not activities - the end you want, not the steps
Turn vague asks into testable done-when criteria
Resolving decisions up front: roughly a 33× quality lever vs one-shot prompting

Spec frameworks

Three ways to bring SDD to your agent

Spec Kit

GitHub toolkit
heavier, phase-gated
specify CLI, 30+ agents

OpenSpec

lightweight framework
"the change" delta, brownfield
no Python, ~5-min setup

agent-skills

SDD as a drop-in skill
no separate CLI
loads into your harness

All three: the spec is the source of truth. (Spec Kit deep-dive is now Chapter 2.)

Lessons 1.5 - 1.6

Keep it sharp, then check it

Context rot: a full window reasons worse - target 40-60% utilisation
Subagents as context firewalls: do the mess, return a summary
Cross-agent review: a second, independent agent catches what the first missed

Chapter 2

Spec-driven
development in depth

The spec frameworks, with Spec Kit as the main event.

Chapter 2 · spec-driven development in depth

Spec Kit: the deep dive

The loop: constitution → specify → clarify → plan → tasks → implement
Each phase emits a committed artifact (spec.md, plan.md, tasks.md) - the spec is the source of truth
analyze: a read-only gate that catches gaps before implementation
Lighter pair: OpenSpec (change-based, brownfield) & agent-skills (drop-in skill)
Pick by weight - and a spec is part of your harness, not paperwork

Chapter 3

Scaling & trusting
the harness

Running agents for real work, safely.

Chapter 3 in one slide

Five moves for real work

Tools: ten focused beat fifty overlapping
MCP & safety: a tool's description is trusted text - vet it, least privilege
Substrate: filesystem (memory), git (undo), sandbox (safe run)
Long-horizon: plan files, done-condition first; on disk beats in the window
Cost: watch it (up to 32× swings); stop tuning a disposable harness

Chapter 4

Measuring & evolving

You can't improve what you don't measure.

Chapter 4 in one slide

Make the ratchet honest

Vibes aren't evidence - a score is a property of the model+harness pair
Build an eval set: a few of your own golden tasks
Read failures, not just pass rates - each bucket is a fix
Automate it (self-improving loop); keep a change only if it beats the noise

Chapter 5

Multi-agent
& team harnesses

Scaling across agents and across people.

Chapter 5 in one slide

Across agents, across people

Add a second agent only for isolation, parallelism, or diversity (else it hurts: 39-70%)
Orchestrate with named shapes: fan-out, pipeline, judge panel
Isolate parallel work: git worktrees / sandboxes
Team harness: commit rules, hooks, skills - everyone inherits it on clone
Fight drift: review rules like code, prune the stale ones

Chapter 6

Capstone:
build your harness

Assemble it, run it end to end, make it yours.

The throughline

Build → Spec-drive → Scale → Measure → Share → Make it yours

Keep the harness lean, earned, and disposable - and keep ratcheting. The model is the commodity; the harness is the edge.

Start Monday

Your 5-piece starter harness

A rules file - start near-empty, earn each line
One hook - run the tests after every edit
A clean git branch + a safe place to run
Three golden tasks - your eval set
A review habit - a fresh second pass

Day 1: commit a near-empty rules file. Then ratchet from real failures.

Keep ratcheting