Reference · Canonical nomenclature

Glossary

The shared vocabulary of the course. Every lesson uses these terms exactly as defined here.

Agent: Agent = Model + Harness. A raw model is not an agent; it becomes one when a harness gives it state, tools, feedback loops, and constraints. Trivedy
Harness: Everything that is not the model: prompts, tools, context policy, hooks, sandboxes, subagents, and the loop that runs them. "If you're not the model, you're the harness."
Harness engineering: Treating that scaffolding as a first-class artifact that you deliberately tighten with every failure - rather than something you set up once. Osmani
The ratchet: The core loop: every observed mistake becomes a permanent rule or hook; constraints are added only after a real failure and removed only when a better model makes them redundant. It only ever tightens.
Work backwards from behaviour: Derive every harness component from a specific behaviour the model can't deliver alone. "If you can't name the behaviour a component exists to deliver, it probably shouldn't be there."
"Skill issue" reframe: Re-attributing an agent failure to your configuration, not the model's weights. The fix lives in the harness. HumanLayer
The Loop (ReAct): Reason → Act → Observe, repeated: while (model returns tool calls) { execute → capture → append → call again }. The entire architecture of Claude Code / Cursor / Codex fits inside this.
Compound error: Per-step accuracy multiplies down a long chain: 0.8²⁰ ≈ 1%. Why long autonomous runs collapse without guardrails.
Context rot: Models reason worse as the context window fills. "Lost in the middle": moving key info from position 1 to 10 dropped accuracy 30%+. Target 40–60% window utilisation. Chroma
Compaction: Summarising or offloading older context before the window fills. Intentional and frequent, not a last resort.
Tool-call offloading: Keep the head/tail of a large tool output in context; write the rest to disk so it doesn't rot the window.
Progressive disclosure / Skills: Reveal tools and instructions only when the task needs them, tiered on demand - instead of dumping everything up front.
Subagent (context firewall): A child agent with its own fresh context window that returns only a condensed answer, keeping the parent's window clean.
AGENTS.md / CLAUDE.md ("map, not manual"): A root rulebook injected every turn. Keep it a pilot's checklist, not a style guide - under ~60–200 lines, each line earned by a real failure.
Hooks: Deterministic scripts wired into the lifecycle that fire every time (typecheck, test, block-destructive, auto-format). "The prompt is where we steer. The harness is where we enforce." Prompt compliance ~70–90%; hooks 100%.
Spec before code / sprint contract: Resolve the decisions (objective, files in scope, architecture, test plan, done-when, boundaries) before generating code. Specify states, not activities. Resolving 15 of 20 decisions upfront ≈ a 33× quality improvement.
Planner / generator / evaluator split: Separate the agent that generates from the agent that judges - "GANs for prose". Avoids a model grading its own work.
Cross-agent review: Run a second model or harness over the same code: "two agents, same codebase, different failures detected."
Compound engineering loop: Feed each failure back into a durable harness improvement. "The model does not get smarter. The harness does. That is the flywheel."
Ralph loop: A hook intercepts the agent's exit attempt and re-injects the prompt into a fresh context, forcing continuation on a long job.
Harness-as-a-Service (HaaS): Building on harness SDKs (runtimes: Claude Agent SDK, Codex SDK) instead of raw LLM APIs (completions).
MAST: UC Berkeley's Multi-Agent System failure Taxonomy - 1,600+ execution traces categorising how agent systems actually fail. Berkeley
Large language model (LLM): The "brain" that predicts text and code. On its own it only predicts - it needs a harness to act.
Frontier model: A model at the leading edge of capability (2026: Claude Opus 4.x, GPT-5.x, Gemini 3.x). Usually closed.
Closed model: A model you can only reach through a company's API; you never hold the weights. Send text, get text back.
Open-weight model: A model whose weights are published so anyone can download and run it (2026: DeepSeek, Qwen, Kimi, GLM, Mistral). In 2026 these closed most of the gap to frontier - measured in single benchmark points on coding/math, though closed models still lead on the hardest agentic tasks.
Spec-driven development (SDD): Write a structured spec (what, why, done-when) first; the agent generates code that serves the spec. The spec is the source of truth, not the code. Contrast with vibe coding.
Vibe coding: Prompting freely and accepting output without a spec. Fast and fine for throwaways; unreliable for production because the agent fills every gap with its own guesses. Karpathy
Constitution: In Spec Kit, a project's non-negotiable principles, written once and referenced by every later step. A permanent rulebook above individual specs.
The change (OpenSpec): OpenSpec's unit of work: a bounded, reviewable delta with its own folder (proposal, spec deltas, tasks), flowing propose → apply → archive. Contrast Spec Kit's heavier feature-level phase gates.
Tool: A capability the harness gives the model to call - read a file, run a command, search the web, hit an API. Ten focused tools beat fifty overlapping ones; the model must hold the whole menu in mind each turn.
MCP (Model Context Protocol): A standard way to plug external tools and data sources into an agent. Powerful, but a tool's description is injected as trusted text - so an untrusted MCP server is an attack surface (prompt injection). Vet the source; grant least privilege.
Sandbox: An isolated, safe execution environment (runtimes, test CLIs, headless browser) where the agent runs code and tests without risking your real machine - and where many runs can go in parallel.
Schema gating: A safety technique: make an unsafe tool invisible to the model (remove it from the available schema) rather than trusting the model to refuse to use it. Constrain the environment, don't rely on good behaviour.
Planning file: A plan/todo file on disk that the agent decomposes the goal into and keeps checking off. Externalising the plan lets it survive context resets - on disk beats in the window.
Sprint contract: Agreeing the done-condition before any code: generator and evaluator settle exactly what "done" means up front, so scope can't drift mid-run.
Eval set: A small, fixed set of representative real tasks you run your harness against after every change, to measure whether it actually improved - your own benchmark, not a public one.
Golden task: One task in the eval set with a known-good outcome and a clear done-when, so a pass or fail is unambiguous.
Agentic Harness Engineering (AHE) / self-improving loop: A loop where the agent edits its OWN scaffolding (prompt, tools, memory) from execution feedback, so the harness keeps pace with model releases. The ratchet, automated.
Judge panel: Several independent evaluator agents (ideally different models/lenses) that each judge the same output; a majority vote survives. Diversity catches failure modes a single judge misses.
Orchestration: Coordinating multiple agent runs deterministically - fan-out (parallel independent subtasks), pipeline (each item flows through stages), judge panel, loop-until-done.
Git worktree: A separate working directory on its own branch off one repo, so several agents can work in parallel without clobbering each other's files. Merge and verify after.
Team harness: The harness treated as a shared team asset: CLAUDE.md/AGENTS.md, skills, and hooks committed to the repo so every teammate - and every agent - inherits the same earned rules on clone.
Harness drift: The slow bloat of a shared harness as many hands add unearned rules and stale scaffolding. Fought by reviewing rules like code, assigning ownership, and pruning on model upgrades.

Course map Start: 0.1 Model landscape → Resources

Missing a term? Ask me and I'll add it - the glossary is the course's single source of truth for vocabulary.