Chapter 3 · Scaling & Trusting the Harness · Lesson 3.4

Long-Horizon Autonomy

The win: keep an agent on track across a long, many-step job without it drifting, forgetting, or quitting early.

Chapter 0 · Sprint Zero
Chapter 1 · The ratchet & the practice loop
Chapter 2 · Spec-driven development in depth
3.1 · Tools: fewer and sharper
3.2 · MCP & tool safety
3.3 · Filesystem, Git & sandboxes
3.4 · Long-horizon autonomy
3.5 · Cost, observability & HaaS

Why long runs fall apart

Give an agent a quick, one-step task and it usually nails it. Give it a big job with a dozen steps - migrate a module, refactor across ten files, fix a whole class of bug - and something else tends to happen. It drifts off the goal, loses the thread of what it was doing, or decides it's finished long before it actually is. A lot of this traces back to context rot (Lesson 1.5): as the conversation window fills up, the model reasons worse and the original goal slips further from view. The fix isn't a smarter model - it's structure that survives a long run, so the goal outlasts any single stretch of conversation.

Write the plan to a file

The first move is simple: before it starts working, have the agent break the goal into steps and write them into a plan or to-do file on disk - a plain checklist it keeps returning to and ticking off as it goes. Because that file lives outside the conversation, it survives the window filling up, a crash, or a fresh session. The plan is no longer something the agent has to remember; it's something it can re-read. This is the planning file, and it builds directly on spec before code (Lesson 1.4): first pin down what "done" looks like, then let the agent turn that into a checklist it can work through.

Agree the done-condition first

A long run drifts fastest when nobody said where the finish line is. So settle it up front. Before any code is written, the agent proposing the work and whatever checks it - you, or a second agent - agree on exactly what "done" means. That agreement is the sprint contract, and it's the guardrail that stops the scope from quietly wandering halfway through.

Define what "done" means before you start - lock the finish line in writing first, so a long run can't silently drift somewhere you never asked it to go. Addy Osmani, Agent Harness Engineering

Force it to keep going: the Ralph loop

Even with a plan and a contract, agents love to stop early - they announce "done" while half the checklist is still open. The Ralph loop handles that with a hook (the automatic scripts from Lesson 1.3): when the agent tries to exit, the hook catches it and re-injects the goal into a fresh context, telling it to carry on. It's a small piece of the harness that turns "I think I'm finished" into "keep going until the job actually is", pushing the run further than a single conversation would manage on its own.

When it's really long: reset the whole session

For the longest jobs, don't just trim the context - deliberately tear the whole session down and rebuild it. Start a brand-new window from a short handoff brief plus the plan file, and the agent picks up exactly where it left off: fresh window, same goal, no accumulated rot. This is compaction taken to its logical end - instead of summarising older context, you throw the conversation away entirely and rehydrate from what's on disk.

Four ways to go long

Plan file - the goal, broken into a checklist on disk the agent keeps ticking off.
Sprint contract - the done-condition, agreed before any code so scope can't drift.
Ralph loop - a hook that re-injects the goal when the agent tries to quit early.
Full reset - tear down the session and rebuild from a short brief plus the plan file.

The one idea underneath

All four techniques are the same move in different clothing: on disk beats in the window. Conversation history is fragile - it fills, it rots, it vanishes on restart. A file doesn't. So anything the agent must not forget across a long run - the goal, the plan, the definition of done - belongs in a file, not in the chat. Get that right and a run can survive resets you didn't even plan for.

Check yourself

Long agent runs tend to -

On big multi-step jobs, agents drift off-goal, lose the thread, or stop early - largely because context rot degrades reasoning as the window fills. That's the whole reason you add structure that survives the run.

A plan survives resets because it -

Externalising the plan to a file means it outlives the conversation window, a crash, or a fresh session. The agent re-reads it instead of having to remember it.

A Ralph loop keeps working by -

A hook intercepts the agent's exit attempt and re-injects the goal into a fresh context, forcing it to continue a long job instead of stopping prematurely.

Do this now (5 min)

Pick a task with four or more steps. Before it touches anything, ask the agent to:

Write a plan file with a checklist and a one-line "done when…" at the top.
Work through the checklist, ticking each item off in the file as it completes it.

Halfway through, reset the session and point it back at the plan file. Notice it stays coherent - it reads where it was and carries on, instead of starting over.

I'm your teacher - ask freely. Got a real long task in mind? Tell me the goal and I'll help you draft the plan-file checklist and a testable "done when…" line you can hand straight to the agent. Getting those two right is most of the battle.

Go deeper

Primary source (read this): Addy Osmani - Agent Harness Engineering, on planning files, the sprint contract, and the Ralph loop for keeping long runs on track.

Wisdom (test it on people): the HumanLayer community - a good place to compare notes on what actually holds up across very long autonomous runs.

← 3.3 Filesystem, Git & sandboxes Course map Glossary Next: 3.5 Cost, observability & HaaS →