Skip to main content
autonomous-coding tutorial workflow first-task getting-started

Good vs Bad Unattended AI Coding Tasks: How to Know Before You Start

Not every backlog item works as an overnight coding run. Here is how to tell the good fits from the bad ones before you spend an evening setting it up.

Codeberg-first

Ralph Workflow is free and open source. Inspect the primary repo on Codeberg before you install — or jump to the GitHub mirror.

The fastest way to kill trust in an AI coding pipeline is to give it the wrong kind of task and blame the tool when the output is disappointing. The second-fastest way is to avoid it entirely because you are not sure which tasks are safe to hand off.

This guide gives you a concrete framework for deciding whether a backlog item is a good fit for an unattended coding run. It is not about the tool — you can apply this thinking whether you use Ralph Workflow, a raw agent session, or any other pipeline. The principles are the same.

The four-attribute test

A task is a good fit for unattended coding when it scores well on four attributes. Each attribute is a continuum, not a binary switch, but the pattern is consistent: good tasks cluster toward the left side of each spectrum.

Attribute 1: Specification clarity

Good fit Bad fit
"Replace the SQL queries in report_store.py with SQLAlchemy ORM calls, keeping the same method signatures and return types." "Improve the database layer."
The specification leaves almost nothing to interpretation. The specification leaves everything to interpretation.

Specification clarity is not about length. A three-sentence spec with sharp boundaries beats a three-paragraph spec that meanders. The test: if another developer on your team could run with the spec and produce something close to what you expect, the spec is clear enough.

Attribute 2: Verifiability

Good fit Bad fit
"pytest tests/unit/test_report_store.py -v must pass without modification." "The code should be cleaner."
Correctness is mechanically checkable. A script says yes or no. Correctness is subjective. Two reasonable people could disagree.

Verifiability is the attribute that separates trust from wishful thinking. If you cannot write a concrete check that the agent can run, you cannot trust the output. The check does not need to be exhaustive — it needs to catch the failure modes that would make you reject the result.

Attribute 3: Boundary isolation

Good fit Bad fit
"Modify dal/report_store.py and models/report.py only. Do not touch anything in api/ or handlers/." "Build the reporting feature."
The work fits inside a named set of files or modules. The work touches everything and you discover the blast radius afterward.

Isolation matters for two reasons. First, a bounded diff is reviewable in a few minutes. Second, if the run goes sideways, you know exactly where to look. Unbounded tasks produce unbounded diffs, and unbounded diffs are the #1 reason people abandon unattended coding after one bad experience.

Attribute 4: Non-critical surface

Good fit Bad fit
A refactoring of an internal module covered by tests. If it fails, tests catch it; if tests pass but behavior drifts, you catch it in review. A change to the payment-processing path that, if wrong, silently double-charges customers.
Failure is detectable and contained. Failure is silent and expensive.

The rule for first tasks: pick something where the worst-case failure mode is "the diff looks wrong and you do not merge it," not "something breaks in production and you find out from customers." As you build trust in the pipeline, you can raise the stakes. Start low.

The scoring check: 10-second task evaluation

Run any backlog item through this checklist before you hand it to an agent:

  1. Can I describe what "done" means in one sentence? (yes = +1)
  2. Is there a concrete, runnable check that tells me whether the result is correct? (yes = +1; bonus +1 if it already exists as a test)
  3. Is the blast radius bounded to named files or modules? (yes = +1)
  4. Is the worst-case failure "the diff is wrong and I do not merge it" rather than something that silently breaks? (yes = +1)

Score interpretation: - 4-5: Excellent fit. Write the spec, start the run, sleep well. - 3: Decent fit. Tighten the spec's verification section and run it. - 2: Marginally okay for a third or fourth task after you have built some trust. Not for a first task. - 0-1: This task needs too much judgment, too much discovery, or touches too many sensitive surfaces. Do it yourself or break it into smaller pieces until the pieces start scoring higher.

Real examples, scored

Example 1: Migrate a test suite from unittest to pytest

  • Done in one sentence: "Convert all unittest.TestCase subclasses in tests/ to plain pytest functions and replace self.assertEqual with bare assert." ✓ (+1)
  • Runnable check: pytest tests/ -v — all existing tests must pass, coverage must not drop. ✓ (+1, exists as CI check = +1)
  • Blast radius: tests/ directory only. ✓ (+1)
  • Worst case: tests fail, you do not merge, you learn something about the test suite. ✓ (+1)
  • Score: 5/5. Excellent fit.

Example 2: Add unit tests for an untested service class

  • Done in one sentence: "Write unit tests for UserService covering all public methods, targeting 80% branch coverage." ✓ (+1)
  • Runnable check: pytest tests/unit/test_user_service.py --cov=app/services/user_service.py --cov-report=term ✓ (+1, you define the check = +1)
  • Blast radius: new test file plus maybe minor refactors to make the service testable. Mostly additive. ✓ (+1)
  • Worst case: tests pass but are weak, you find gaps during review, you ask for a second pass. Annoying but not dangerous. ✓ (+1)
  • Score: 4/5. Good fit.

Example 3: Refactor an 800-line module into smaller files

  • Done in one sentence: "Split handlers.py (837 lines) into handlers/auth.py, handlers/events.py, and handlers/export.py with a shared handlers/base.py, keeping all public imports stable." ✓ (+1)
  • Runnable check: pytest tests/ passes, python -c "from app.handlers import EventHandler" succeeds. ✓ (+1)
  • Blast radius: handlers.py becomes a package. Import paths change. The refactoring tooling (e.g., @deprecated shims) may be needed. ⚠ (+0.5)
  • Worst case: imports break in unexpected places, CI catches it, you fix two import paths manually. Annoying but reversible. ✓ (+1)
  • Score: 3.5/5. Good fit with a caveat — write a stronger "what must not change" section to protect import paths.

Example 4: "Add notifications for new events"

  • Done in one sentence: "When a new event is created, send an email to all participants." Seems clear, but... "Send an email" — which template? HTML or text? What about digest batching? What about unsubscribe links? The spec hides dozens of decisions. ✗ (+0.5 — the sentence exists but does not pin down anything specific)
  • Runnable check: What exactly do you test? That an email was queued? Delivered? Rendered correctly across clients? No existing tests for the notification path. ✗ (+0)
  • Blast radius: touches event creation, user preferences, email infrastructure, job queueing. ✗ (+0)
  • Worst case: emails go to wrong recipients, HTML renders broken, unsubscribe links are missing, somebody reports you as spam. ⚠ (-1 for silent production failure)
  • Score: -0.5/5. Bad fit. Before handing this to an agent, you would need to decompose it into: "Create the NotificationJob class" (score 4), "Add the email template" (score 3), "Wire the job dispatch into EventService.create" (score 3), etc. Each sub-task scores well even though the parent task does not.

Example 5: "Improve the landing page design"

  • Done in one sentence: ✗ — "improve" is not a spec, it is a feeling. (+0)
  • Runnable check: What test verifies "better design"? None. ✗ (+0)
  • Blast radius: potentially the entire frontend. ✗ (+0)
  • Worst case: not dangerous, but you will spend more time writing feedback than you would have spent just designing it yourself. ✗ (+0)
  • Score: 0/5. Bad fit. This is a design task, not an engineering task. These need human taste.

Why most first tasks fail (and how to avoid it)

The pattern across failed first-run reports is remarkably consistent:

Problem 1: The task was too ambiguous. The agent filled in the gaps with reasonable guesses that were wrong for the project. Fix: write the spec as if you were handing the task to a capable junior developer who has never seen your codebase.

Problem 2: The diff was too large to review. A 600-line diff with no tests attached feels like a forensic investigation. Fix: scope the task to produce a diff under 300 lines and require tests as part of the verification gate.

Problem 3: There was no verification gate. The agent said done, the human checked manually, found issues, and lost trust. Fix: every spec must include at least one concrete, runnable verification step. Zero-step specs are a red flag for the task itself.

Problem 4: The task touched a critical path without guardrails. The agent modified the auth system because "it was needed" and the spec did not forbid it. Fix: the "what must not change" section is mandatory. It is a fence, not a suggestion.

The one-minute gut check

Before you write any spec, ask yourself:

If an over-eager intern with good technical skills but no context on my project did this task, and I only had three minutes to review the result, could I confidently decide whether to merge it?

If the answer is yes, the task fits. If the answer is "I would need to read every line" or "I would need to understand the design decisions first," the task is too open-ended. Tighten the spec or pick a different task.

What to read next

Ralph Workflow is free, open source (AGPL-3.0), and runs on your machine using the coding agents you already have installed. Inspect it on Codeberg or check the mirror on GitHub. Full documentation at ralphworkflow.com/docs.

When Your Overnight AI Coding Run Fails: A Troubleshooting Guide

Your first unattended coding run returned gibberish, hit an API limit at 3 AM, or left you with a half-built PR. Before you give up on the whole idea, check the five most common failure modes — and the fixes that actually work.

debugging troubleshooting

Best evaluator path

Turn the idea into a real overnight test, not another saved tab.

Codeberg-first: open the primary repo, choose one bounded backlog task, run it tonight, and ask one question tomorrow morning — would I merge this? GitHub stays available as the mirror.

Open the primary Codeberg repo

Read the public source before you install anything.

Pick a first task

Use the guide to choose a bounded backlog item that is honest to review.

Install and run Ralph Workflow

Keep the machine awake, then decide in the morning whether the diff is good enough to merge.