Overnight Refactoring with Ralph Workflow: A Walkthrough

You have a messy Python module with 800 lines of mixed business logic and HTTP-handling. You could spend Thursday afternoon untangling it. But Thursday afternoon is meetings.

What if you write the spec in five minutes, start the run, and close your laptop?

This is the exact walkthrough of doing that with Ralph Workflow — from spec to merge decision.

Step 1: The spec (PROMPT.md)

The one-paragraph spec is the most important part. It is not a prompt. It is a contract.

For the refactoring task, I wrote this in PROMPT.md:

Refactor modules/order_handler.py into a clean layered structure:

1. Split into:
   - modules/orders/validation.py (input validation, schema checks)
   - modules/orders/pricing.py (discount logic, tax calculation)
   - modules/orders/persistence.py (database writes, status transitions)
   - modules/orders/api.py (HTTP handler — thin layer only)

2. Each new module must have its own unit test file following the existing
   pytest conventions in tests/orders/.

3. Preserve all existing behavior. Do not change the public API surface of
   the top-level order_handler module — it should import and re-export from
   the new modules.

4. The existing integration tests at tests/integration/test_orders.py must
   still pass without modification.

5. Format with black, lint with ruff.

That is 14 lines. It took about four minutes to write. The key is concrete scope boundaries, naming conventions, and an explicit "what must not break" constraint.

Step 2: The run

ralph --pipeline build

Then I closed the laptop.

Behind the scenes, Ralph Workflow goes through three phases automatically:

Planning phase

The planning agent reads the spec and the existing 800-line file, then produces a detailed plan breaking down exactly which functions go where, which imports need to move, the order of extraction, and the test structure. It uses Claude for planning — the expensive, thorough model.

🔍 What I saw in the morning: The plan contained a note I had not thought of: "order_handler.get_order() calls validation logic inline, so pull the schema check out first before any file-split, or the intermediate state will have duplicate validation paths." That is the kind of edge case you catch when a model reads the full file before writing anything.

Build phase

The build loop runs the spec against each sub-task in sequence. Each extracted module gets written, tested, fixed if broken, then the next module starts.

The build uses a cheaper model for the extraction work — the kind of mechanical "move this function, update that import" task that does not need Claude-level judgment. The plan from the previous phase acts as the ground truth, so the cheaper model is just executing a detailed blueprint.

🔍 What I saw in the morning: The work completed in 3 loops: - Loop 1: validation.py extracted with tests — passed - Loop 2: pricing.py extracted — one test initially failed because a constant was referenced before its import was updated. The build loop caught it, fixed the import, re-ran tests — passed. - Loop 3: persistence.py + api.py extracted together — passed

The key detail here: I never saw the broken state. The loop fixed it before handing back.

Verification phase

The verification checks run the full integration test suite, check that the public API surface is unchanged (no deleted or renamed exports), and run the linter/formatter.

Everything passed.

Step 3: The morning review

The run produced:

artifacts/2026-05-28/
├── plan.md                          # the detailed plan
├── modules/orders/validation.py     # 120 lines
├── modules/orders/pricing.py        # 95 lines
├── modules/orders/persistence.py    # 180 lines
├── modules/orders/api.py            # 65 lines
├── tests/orders/test_validation.py  # 45 lines
├── tests/orders/test_pricing.py     # 38 lines
├── tests/orders/test_persistence.py # 52 lines
└── run-report.md                    # what happened, what was fixed

The run-report.md includes a clean summary of every checkpoint, every test result, and the one auto-fixed import issue.

I opened the diff in my editor. The structure was clean. The one auto-fix was documented — a missing import of ORDER_STATUSES that would have been a runtime error if the loop had not caught it.

Verdict: merged. The diff was specific, tested, and honest enough that I was comfortable shipping it without additional changes.

What I learned

Spec writing is the bottleneck, not the tool. The quality of the result was proportional to how specific the PROMPT.md was. Vague specs produce code that needs rework. Tight specs produce code ready to merge.
The auto-fix matters. The one import error the build loop caught and fixed is exactly the kind of thing you discover at 2 AM if you run a raw agent. Having the verification loop catch it before morning is the difference between "ready to review" and "broken, restart."
The artifact structure removes trust issues. I did not have to wonder "did it actually run the tests?" — the run-report.md shows every command and exit code. The plan shows the reasoning before any code was written.
This is not for every task. Quick one-file edits are faster with Claude Code directly. The workflow shines when the task has clear boundaries, production risk if done wrong, and a spec that fits in one paragraph.

Try it tonight

The honest first task is not the demo project on your desktop. It is one bounded real task from your backlog — a refactoring, a unit test migration, a well-specified feature slice — written up in PROMPT.md and run while you sleep.

Pick one task tonight. Write the spec. Run it. Decide in the morning whether you would merge it. That is the only honest evaluation.