A 20-minute tour
Agentic Harness
Engineering
The model is the commodity. The harness is your edge.
From "prompt and hope" to a measured, ratcheted practice
The shift
The models have converged
0.2 pt
open DeepSeek V4-Pro vs closed Opus 4.6 on SWE-bench
~3 mo
open-weight now lags the frontier by months, not generations
11.9 → 5.4%
gap between the #1 and #10 coding model, in one year
So "which model?" stopped being the interesting question.
The core idea
Agent = Model + Harness
The harness is everything that isn't the model: the prompts, tools, memory, checks, and the loop that runs them. "If you're not the model, you're the harness."
Why it matters
Same model. Different harness. Wildly different results.
52.8 → 66.5%
rebuild only the harness, same model (LangChain)
77 vs 93%
one Opus, two harnesses, identical tasks
32×
cost swing for near-identical code
You can't buy good agentic coding with a bigger model. You engineer the harness.
Chapter 1
The Ratchet
& the practice loop
The core habit, and the moves that compound it.
Definition
ratch·et
noun · as used in this course
A harness discipline that only ever tightens. Each agent failure earns one permanent fix - a rule, a hook, or a reviewer - so the same mistake can never recur. Nothing is added speculatively; nothing is removed except when a better model makes it redundant.
Origin: the mechanical ratchet - a gear that turns one way and never slips back. Applied here to the harness, which compounds every lesson and never unlearns one.
The one habit
When the agent gets something wrong, don't re-prompt it - change the harness so it can never get that wrong again.
The ratchet only ever tightens. "The model doesn't get smarter. The harness does."
Where a fix lives
Three homes for a fix
Rule
- the agent just needs to know it
- a line in CLAUDE.md
Hook
- must be enforced every time
- a script that runs automatically
Reviewer
- needs judgement a script can't make
- a second agent checks it
Lesson 1.2
CLAUDE.md is a pilot's checklist
- Injected into the model every single turn - precious space
- Every line must be traceable to a real failure
- Models reliably follow only ~150-200 instructions - keep it lean
Lesson 1.3
Hooks, not instructions
The prompt is where you steer. The harness is where you enforce.
Prompt compliance ~70-90%. A hook fires 100%. Success is silent; failures are verbose.
Lesson 1.4
Spec before code
- Specify states, not activities - the end you want, not the steps
- Turn vague asks into testable done-when criteria
- Resolving decisions up front: roughly a 33× quality lever vs one-shot prompting
Spec frameworks
Three ways to bring SDD to your agent
Spec Kit
- GitHub toolkit
- heavier, phase-gated
specify CLI, 30+ agents
OpenSpec
- lightweight framework
- "the change" delta, brownfield
- no Python, ~5-min setup
agent-skills
- SDD as a drop-in skill
- no separate CLI
- loads into your harness
All three: the spec is the source of truth. (Spec Kit deep-dive is now Chapter 2.)
Lessons 1.5 - 1.6
Keep it sharp, then check it
- Context rot: a full window reasons worse - target 40-60% utilisation
- Subagents as context firewalls: do the mess, return a summary
- Cross-agent review: a second, independent agent catches what the first missed
Chapter 2
Spec-driven
development in depth
The spec frameworks, with Spec Kit as the main event.
Chapter 2 · spec-driven development in depth
Spec Kit: the deep dive
- The loop:
constitution → specify → clarify → plan → tasks → implement
- Each phase emits a committed artifact (spec.md, plan.md, tasks.md) - the spec is the source of truth
analyze: a read-only gate that catches gaps before implementation
- Lighter pair: OpenSpec (change-based, brownfield) & agent-skills (drop-in skill)
- Pick by weight - and a spec is part of your harness, not paperwork
Chapter 3
Scaling & trusting
the harness
Running agents for real work, safely.
Chapter 3 in one slide
Five moves for real work
- Tools: ten focused beat fifty overlapping
- MCP & safety: a tool's description is trusted text - vet it, least privilege
- Substrate: filesystem (memory), git (undo), sandbox (safe run)
- Long-horizon: plan files, done-condition first; on disk beats in the window
- Cost: watch it (up to 32× swings); stop tuning a disposable harness
Chapter 4
Measuring & evolving
You can't improve what you don't measure.
Chapter 4 in one slide
Make the ratchet honest
- Vibes aren't evidence - a score is a property of the model+harness pair
- Build an eval set: a few of your own golden tasks
- Read failures, not just pass rates - each bucket is a fix
- Automate it (self-improving loop); keep a change only if it beats the noise
Chapter 5
Multi-agent
& team harnesses
Scaling across agents and across people.
Chapter 5 in one slide
Across agents, across people
- Add a second agent only for isolation, parallelism, or diversity (else it hurts: 39-70%)
- Orchestrate with named shapes: fan-out, pipeline, judge panel
- Isolate parallel work: git worktrees / sandboxes
- Team harness: commit rules, hooks, skills - everyone inherits it on clone
- Fight drift: review rules like code, prune the stale ones
Chapter 6
Capstone:
build your harness
Assemble it, run it end to end, make it yours.
The throughline
Build → Spec-drive → Scale → Measure → Share → Make it yours
Keep the harness lean, earned, and disposable - and keep ratcheting. The model is the commodity; the harness is the edge.
Start Monday
Your 5-piece starter harness
- A rules file - start near-empty, earn each line
- One hook - run the tests after every edit
- A clean git branch + a safe place to run
- Three golden tasks - your eval set
- A review habit - a fresh second pass
Day 1: commit a near-empty rules file. Then ratchet from real failures.
Keep ratcheting
Thank you