Good vs Bad Unattended AI Coding Tasks: How to Know Before You Start

The fastest way to kill trust in an AI coding pipeline is to give it the wrong kind of task and blame the tool when the output is disappointing. The second-fastest way is to avoid it entirely because you are not sure which tasks are safe to hand off.

This guide gives you a concrete framework for deciding whether a backlog item is a good fit for an unattended coding run. It is not about the tool — you can apply this thinking whether you use Ralph Workflow, a raw agent session, or any other pipeline. The principles are the same.

The four-attribute test

A task is a good fit for unattended coding when it scores well on four attributes. Each attribute is a continuum, not a binary switch, but the pattern is consistent: good tasks cluster toward the left side of each spectrum.

Attribute 1: Specification clarity

Good fit	Bad fit
"Replace the SQL queries in `report_store.py` with SQLAlchemy ORM calls, keeping the same method signatures and return types."	"Improve the database layer."
The specification leaves almost nothing to interpretation.	The specification leaves everything to interpretation.

Specification clarity is not about length. A three-sentence spec with sharp boundaries beats a three-paragraph spec that meanders. The test: if another developer on your team could run with the spec and produce something close to what you expect, the spec is clear enough.

Attribute 2: Verifiability

Good fit	Bad fit
"`pytest tests/unit/test_report_store.py -v` must pass without modification."	"The code should be cleaner."
Correctness is mechanically checkable. A script says yes or no.	Correctness is subjective. Two reasonable people could disagree.

Verifiability is the attribute that separates trust from wishful thinking. If you cannot write a concrete check that the agent can run, you cannot trust the output. The check does not need to be exhaustive — it needs to catch the failure modes that would make you reject the result.

Attribute 3: Boundary isolation

Good fit	Bad fit
"Modify `dal/report_store.py` and `models/report.py` only. Do not touch anything in `api/` or `handlers/`."	"Build the reporting feature."
The work fits inside a named set of files or modules.	The work touches everything and you discover the blast radius afterward.

Isolation matters for two reasons. First, a bounded diff is reviewable in a few minutes. Second, if the run goes sideways, you know exactly where to look. Unbounded tasks produce unbounded diffs, and unbounded diffs are the #1 reason people abandon unattended coding after one bad experience.

Attribute 4: Non-critical surface

Good fit	Bad fit
A refactoring of an internal module covered by tests. If it fails, tests catch it; if tests pass but behavior drifts, you catch it in review.	A change to the payment-processing path that, if wrong, silently double-charges customers.
Failure is detectable and contained.	Failure is silent and expensive.

The rule for first tasks: pick something where the worst-case failure mode is "the diff looks wrong and you do not merge it," not "something breaks in production and you find out from customers." As you build trust in the pipeline, you can raise the stakes. Start low.

The scoring check: 10-second task evaluation

Run any backlog item through this checklist before you hand it to an agent:

Can I describe what "done" means in one sentence? (yes = +1)
Is there a concrete, runnable check that tells me whether the result is correct? (yes = +1; bonus +1 if it already exists as a test)
Is the blast radius bounded to named files or modules? (yes = +1)
Is the worst-case failure "the diff is wrong and I do not merge it" rather than something that silently breaks? (yes = +1)

Score interpretation: - 4-5: Excellent fit. Write the spec, start the run, sleep well. - 3: Decent fit. Tighten the spec's verification section and run it. - 2: Marginally okay for a third or fourth task after you have built some trust. Not for a first task. - 0-1: This task needs too much judgment, too much discovery, or touches too many sensitive surfaces. Do it yourself or break it into smaller pieces until the pieces start scoring higher.

Real examples, scored

Example 1: Migrate a test suite from unittest to pytest

Done in one sentence: "Convert all unittest.TestCase subclasses in tests/ to plain pytest functions and replace self.assertEqual with bare assert." ✓ (+1)
Runnable check: pytest tests/ -v — all existing tests must pass, coverage must not drop. ✓ (+1, exists as CI check = +1)
Blast radius: tests/ directory only. ✓ (+1)
Worst case: tests fail, you do not merge, you learn something about the test suite. ✓ (+1)
Score: 5/5. Excellent fit.

Example 2: Add unit tests for an untested service class

Done in one sentence: "Write unit tests for UserService covering all public methods, targeting 80% branch coverage." ✓ (+1)
Runnable check: pytest tests/unit/test_user_service.py --cov=app/services/user_service.py --cov-report=term ✓ (+1, you define the check = +1)
Blast radius: new test file plus maybe minor refactors to make the service testable. Mostly additive. ✓ (+1)
Worst case: tests pass but are weak, you find gaps during review, you ask for a second pass. Annoying but not dangerous. ✓ (+1)
Score: 4/5. Good fit.

Example 3: Refactor an 800-line module into smaller files

Done in one sentence: "Split handlers.py (837 lines) into handlers/auth.py, handlers/events.py, and handlers/export.py with a shared handlers/base.py, keeping all public imports stable." ✓ (+1)
Runnable check: pytest tests/ passes, python -c "from app.handlers import EventHandler" succeeds. ✓ (+1)
Blast radius: handlers.py becomes a package. Import paths change. The refactoring tooling (e.g., @deprecated shims) may be needed. ⚠ (+0.5)
Worst case: imports break in unexpected places, CI catches it, you fix two import paths manually. Annoying but reversible. ✓ (+1)
Score: 3.5/5. Good fit with a caveat — write a stronger "what must not change" section to protect import paths.

Example 4: "Add notifications for new events"

Done in one sentence: "When a new event is created, send an email to all participants." Seems clear, but... "Send an email" — which template? HTML or text? What about digest batching? What about unsubscribe links? The spec hides dozens of decisions. ✗ (+0.5 — the sentence exists but does not pin down anything specific)
Runnable check: What exactly do you test? That an email was queued? Delivered? Rendered correctly across clients? No existing tests for the notification path. ✗ (+0)
Blast radius: touches event creation, user preferences, email infrastructure, job queueing. ✗ (+0)
Worst case: emails go to wrong recipients, HTML renders broken, unsubscribe links are missing, somebody reports you as spam. ⚠ (-1 for silent production failure)
Score: -0.5/5. Bad fit. Before handing this to an agent, you would need to decompose it into: "Create the NotificationJob class" (score 4), "Add the email template" (score 3), "Wire the job dispatch into EventService.create" (score 3), etc. Each sub-task scores well even though the parent task does not.

Example 5: "Improve the landing page design"

Done in one sentence: ✗ — "improve" is not a spec, it is a feeling. (+0)
Runnable check: What test verifies "better design"? None. ✗ (+0)
Blast radius: potentially the entire frontend. ✗ (+0)
Worst case: not dangerous, but you will spend more time writing feedback than you would have spent just designing it yourself. ✗ (+0)
Score: 0/5. Bad fit. This is a design task, not an engineering task. These need human taste.

Why most first tasks fail (and how to avoid it)

The pattern across failed first-run reports is remarkably consistent:

Problem 1: The task was too ambiguous. The agent filled in the gaps with reasonable guesses that were wrong for the project. Fix: write the spec as if you were handing the task to a capable junior developer who has never seen your codebase.

Problem 2: The diff was too large to review. A 600-line diff with no tests attached feels like a forensic investigation. Fix: scope the task to produce a diff under 300 lines and require tests as part of the verification gate.

Problem 3: There was no verification gate. The agent said done, the human checked manually, found issues, and lost trust. Fix: every spec must include at least one concrete, runnable verification step. Zero-step specs are a red flag for the task itself.

Problem 4: The task touched a critical path without guardrails. The agent modified the auth system because "it was needed" and the spec did not forbid it. Fix: the "what must not change" section is mandatory. It is a fence, not a suggestion.

The one-minute gut check

Before you write any spec, ask yourself:

If an over-eager intern with good technical skills but no context on my project did this task, and I only had three minutes to review the result, could I confidently decide whether to merge it?

If the answer is yes, the task fits. If the answer is "I would need to read every line" or "I would need to understand the design decisions first," the task is too open-ended. Tighten the spec or pick a different task.

Try it on your own backlog tonight. Pick one task that outgrew a single AI coding session. Write a one-paragraph spec, run it through Ralph Workflow, and ask yourself tomorrow morning: would you merge the output?

Ralph Workflow is free and open source. It runs the coding agents you already have on your own machine.

Codeberg (primary repo) — ⭐ star, watch, fork
GitHub (mirror)
First-task guide — what task to pick and how to judge the result
Quick install: pipx install ralph-workflow

Good vs Bad Unattended AI Coding Tasks: How to Know Before You Start

The four-attribute test

Attribute 1: Specification clarity

Attribute 2: Verifiability

Attribute 3: Boundary isolation

Attribute 4: Non-critical surface

The scoring check: 10-second task evaluation

Real examples, scored

Example 1: Migrate a test suite from unittest to pytest

Example 2: Add unit tests for an untested service class

Example 3: Refactor an 800-line module into smaller files

Example 4: "Add notifications for new events"

Example 5: "Improve the landing page design"

Why most first tasks fail (and how to avoid it)

The one-minute gut check

Related Posts

Your First Overnight Task with Ralph Workflow: A Start-Here Guide

Overnight Refactoring with Ralph Workflow: A Walkthrough

When Your Overnight AI Coding Run Fails: A Troubleshooting Guide

The four-attribute test

Attribute 1: Specification clarity

Attribute 2: Verifiability

Attribute 3: Boundary isolation

Attribute 4: Non-critical surface

The scoring check: 10-second task evaluation

Real examples, scored

Example 1: Migrate a test suite from unittest to pytest

Example 2: Add unit tests for an untested service class

Example 3: Refactor an 800-line module into smaller files

Example 4: "Add notifications for new events"

Example 5: "Improve the landing page design"

Why most first tasks fail (and how to avoid it)

The one-minute gut check

Related Posts

Related posts

Your First Overnight Task with Ralph Workflow: A Start-Here Guide

Overnight Refactoring with Ralph Workflow: A Walkthrough

When Your Overnight AI Coding Run Fails: A Troubleshooting Guide