Skip to main content
specs agents orchestration engineering

Spec-Driven AI Agents: Why Workflow Is the Unit of Work

Prompts start AI coding runs, but specs and workflow keep them on track. Why unattended agents need planning loops, analysis gates, verification, and git-backed handoffs.

Spec-Driven AI Agents: Why Workflow Is the Unit of Work

Most teams still use AI coding agents prompt-first.

Write a big instruction. Paste it into the tool. Watch the agent edit code. Correct it when it drifts. Ask for another pass. Repeat until the result looks good enough to review. For small tasks, that works. For long-running coding work, it breaks down—not because prompts are useless, and not because the models cannot write code, but because a prompt is the wrong unit of control once the task needs planning, implementation, verification, review, retry, recovery, and handoff.

A prompt can express intent. It cannot reliably enforce process over time. Once you want an AI coding agent to run unattended, the real unit of work is the workflow.

The human is usually the workflow engine

In a normal interactive AI coding session, the human does more than write the prompt. The human decides whether the plan is good enough, notices when the agent drifted, tells it to run tests, decides whether a failed test means retry, rollback, or change direction, keeps the original task in mind when the conversation gets noisy, and ultimately decides when the work is actually done.

That is fine when you are sitting at the keyboard. It is not fine if the goal is to stop babysitting the terminal. For unattended coding, those decisions have to live somewhere else. They cannot live only in the developer’s head, and they cannot live only in a chat transcript. They need to become part of the run.

That is what workflow is for.

Prompts drift because they are not durable

Prompts are good at starting work. They are bad at preserving the task contract across a long run.

A large prompt can contain the goal, the scope, the constraints, and the commands to run. But as the agent works, that original intent gets mixed with tool output, failed attempts, partial fixes, new assumptions, and whatever the model decides is locally useful. The longer the run, the easier it is for the original task to fade.

This is the common failure mode:

  1. The initial prompt describes the right task.
  2. The agent creates a plausible plan.
  3. Implementation exposes complications.
  4. The agent adapts locally.
  5. The local adaptation becomes the new direction.
  6. The final diff is coherent, but no longer quite matches the original task.

That is prompt drift. The agent did not necessarily fail dramatically. It just followed the nearest available context. Real engineering work needs something more stable.

The spec is the contract

A spec gives the run something durable to return to.

A good spec defines the goal, the scope, the non-goals, the constraints, the acceptance criteria, and the verification path.

The point is not to write a giant document. A giant spec can be just as bad as a giant prompt if it hides the important parts. The useful spec is a working contract: clear enough for the agent to act on, narrow enough for the workflow to check, and concrete enough for a human to review against later.

The spec answers a simple question: what are we actually trying to do? That sounds basic, but it is the difference between casual AI coding and unattended AI coding. A prompt starts the conversation. A spec survives the run.

The workflow is the process

A spec says what the task is. The workflow decides how the task moves. That distinction matters.

A spec can say:

Add the missing tests and make sure the suite passes.

But the workflow still has to decide whether the agent should plan before editing, whether the plan is specific enough to execute, whether weak planning should loop back before coding starts, whether development should happen in reviewable chunks, whether failed checks should send the run back to development, whether the final diff should be analyzed before commit, and whether the run should continue, recover, or stop.

This is why workflow becomes the unit of work. The model may write the code, but the workflow determines whether the run is still on track. Without workflow, the human is the process. With workflow, the process becomes explicit enough for the agent to run unattended.

Bigger prompts are not the same as checkpoints

A common response to agent drift is to make the prompt bigger: add more instructions, more rules, more warnings, more examples, more “do not” clauses.

Sometimes that helps. But bigger prompts do not solve the core problem. Long-running coding tasks need checkpoints.

A checkpoint gives the system a chance to ask whether the plan is strong enough to hand off to development, whether the diff is still aligned with the spec, whether the agent changed unrelated files, whether verification passed, whether the work is ready to commit, and whether the run should loop again.

This is the difference between asking an agent to produce one long answer and giving it a disciplined coding loop.

The checkpoint is where unattended work becomes controllable.

The Ralph loop needs structure

The original Ralph idea is simple: repeatedly run the agent in a fresh context and let the filesystem carry progress forward.

That idea is powerful because it avoids one of the biggest problems with long AI sessions: context degradation. Each pass starts clean. The agent reads the repo, the spec, the artifacts, and the current state of the code instead of dragging an ever-growing conversation behind it.

But the raw Ralph loop is still only a loop.

By itself, it does not know whether the plan is good. It does not know whether the diff is acceptable. It does not know whether a failure is a rate limit, an agent crash, a network problem, or a genuine “needs human input” moment.

For unattended coding, Ralph needs structure: planning loops, development loops, analysis gates, commits, recovery, and a handoff a human can review.

That is the shift from “rerun the prompt” to “run the workflow.”

Verification is not optional

An AI agent saying “done” is not the same thing as the repository being in a good state. For unattended coding, the workflow has to run the checks the repo already trusts: tests, linters, type checks, formatters, build commands, and project-specific validation scripts.

Verification does not prove the work is perfect. Tests can be incomplete. Type checks do not know product intent. Linters do not know whether the feature is right. But verification creates objective pressure. It catches the basic failures that prompt-first workflows often hide until review time.

Unattended coding should not mean:

Come back later and discover the repo is broken.

It should mean coming back to a result that has already been pushed through the checks the repo knows how to run.

Review should compare against the spec

Passing tests is not enough. The final review still needs to ask whether the work matches the original task.

Did the agent solve the actual problem? Did it stay in scope? Did it invent requirements? Did it ignore a constraint? Did it make the diff harder to review than necessary?

This is why the spec has to stay visible until the end of the run.

The review phase should not only ask whether the code is valid. It also has to ask whether this is the change we actually asked for.

That distinction is where many AI coding sessions fail.

The code may compile. The tests may pass. The diff may look plausible. But the task contract may still be violated.

Spec-driven review closes that gap.

The handoff should be git history, not a chat transcript

A long chat transcript is a weak engineering artifact. It is hard to review, hard to reproduce, hard to compare against the final code, and it mixes decisions, mistakes, corrections, retries, and implementation details into one conversational stream.

A better handoff is repo-native: the original spec, the accepted plan, logs, artifacts, verification output, commits, and the final diff.

That is the kind of output an engineer can inspect. The point of unattended coding is not that the human disappears. It is that the human returns at a better boundary: not every agent thought, not every failed attempt, not every chat turn. A reviewable diff, backed by the evidence that explains how it got there.

Where Ralph Workflow fits

[Ralph Workflow](/) is built for this kind of spec-driven unattended coding. It takes the Ralph loop—fresh-context passes where the filesystem carries progress forward—and turns it into a disciplined engineering run.

The default workflow is already there. Planning drafts a brief. Planning analysis checks whether it is strong enough. Development builds the diff. Development analysis checks whether the work holds. A successful pass commits. Then the next pass starts again from planning with fresh context and the latest repository state.

That structure matters because weak work should loop back before it lands.

A weak plan should not drift into code.

A weak diff should not cross the commit boundary.

A completed pass should leave evidence in git.

Ralph Workflow is not trying to be a hosted AI IDE. It is not trying to be a generic assistant. It is not asking you to design a workflow from scratch before the first run.

It is the operating system for autonomous coding — run agents unattended and come back to work your team can actually review.

A strong default matters

Custom workflows are useful, but they are not the first selling point.

The first selling point is that Ralph Workflow gives you a strong default operating model before you customize anything.

That matters because most teams do not want a workflow design project. They want to install the tool, point it at the agents they already use, write a spec, run the loop, and review the result.

The default loop exists so unattended coding can start predictable instead of becoming another piece of glue code to maintain.

Customization comes later.

You can tune agent routing, fallback chains, loop transitions, and review policy when your team needs it.

But the initial promise is simpler:

Keep the default workflow. Bring your agents. Let it run.

Bring the agents you already use

Spec-driven workflow should not require locking into one model or one vendor.

Different parts of the coding loop benefit from different strengths.

Planning may need a stronger model. Development may benefit from a cheaper or faster agent. Analysis may benefit from a separate reviewer. Fallback may matter when a provider rate-limits or an agent stalls.

Ralph Workflow sits above the agents. It reuses the CLIs and authentication you already have installed, then routes phases through the configured agent chains.

The point is not abstraction for its own sake.

The point is that unattended coding should survive real-world conditions: rate limits, network failures, weak passes, model differences, and long runs that need to resume instead of restart.

Good specs make better unattended runs

The quality of the spec still matters. A vague spec produces vague work.

Bad:

Improve the dashboard.

Better:

Add loading and empty states to the dashboard. Do not change the data model. Reuse the existing component style. Add tests for the empty state. Run the frontend test suite and type checker before review.

The second version gives the workflow something to enforce. It has scope, non-goals, verification, and a review target.

That is what unattended coding needs: not a perfect document, not a giant RFC, just a clear enough contract for the workflow to keep returning to.

Start with the contract, then run the loop

If you want more reliable AI agent execution, do not start by making the prompt longer.

Start by writing down the contract: what should change, what should not change, how the work should be verified, and what a reviewer should expect to see.

Then run a workflow that keeps returning to that contract as the agent plans, edits, verifies, reviews, commits, and loops.

Use the docs to understand the model, then move into the getting started guide to try the default flow in a real repository.

The goal is not to eliminate prompts. The goal is to stop pretending the prompt alone is enough. Prompts express intent. Specs preserve the contract. Workflow enforces the process.

Once a coding task is large enough to require planning, retries, verification, review, and handoff, the workflow becomes the real unit of work.