Skip to main content
cost-optimization model-routing ai-development autonomous-coding workflow

AI Cost Model Routing: Stop Paying Frontier Prices for Grunt Work

Most AI coding tools burn expensive tokens on boilerplate and planning. Cost model routing — using cheap models for analysis and strong models for implementation — cuts costs by 60% or more without sacrificing quality. Here's how it works and why your workflow tool should support it.

Codeberg-first

Ralph Workflow is free and open source. Inspect the primary repo on Codeberg before you install — or jump to the GitHub mirror.

Developers who run AI coding agents overnight already know the pattern. Your morning invoice arrives and you realize half the spend went on tasks that any model could have done. The planning pass burned Claude Opus tokens reading your README. The fix-up loop burned frontier model credits on whitespace formatting. The verification gate used the most expensive model in your fleet to run a diff.

This is not a design accident — it is a structural defect in how most AI coding tools work. They assume every phase of a coding run deserves the same model. That assumption is wrong. It is also expensive.

The real cost structure of an unattended coding run

An unattended coding run has distinct phases with wildly different model requirements:

Phase What happens Required intelligence
Planning Read repo structure, parse PROMPT.md, produce a step-by-step plan Medium — needs to understand code but not write it
Analysis Read relevant source files, check conventions, grep for patterns Low — information retrieval, not creative work
Implementation Write the actual code, design module boundaries, handle edge cases High — this is where quality lives or dies
Verification Run tests, check invariants, validate output shape Low to Medium — pass/fail decisions with small context windows
Fix-up Read test failures, make targeted corrections Medium — narrow scope, clear feedback signal

In a typical run, the implementation phase represents about 30% of the total token spend. The other 70% — planning, analysis, verification, fix-up — burns frontier model credits on work that does not actually need frontier capability.

If you spend $10 on a coding run, roughly $7 of that is wasted on phases that a cheaper model could have handled. At one run per night, that is over $200/month of pure waste.

How cost model routing works

The idea is straightforward: assign different models to different phases of the workflow based on what each phase actually needs.

# .ralph/config.yml
models:
  plan: "openrouter/deepseek-v4-flash"    # $0.12/M — smart enough for planning
  analyze: "minimax/minimax-m2.7"         # $0.20/M — fast, cheap, reliable
  implement: "openrouter/deepseek-v4-pro" # $1.00/M — the heavy lifter
  verify: "openrouter/deepseek-v4-flash"  # $0.12/M — reviewing, not creating
  fix: "openrouter/deepseek-v4-flash"     # $0.12/M — targeted corrections

A 100K-token run under this configuration costs roughly:

  • Planning: 25K tokens × $0.12/M tokens = $0.003
  • Analysis: 15K tokens × $0.20/M tokens = $0.003
  • Implementation: 30K tokens × $1.00/M tokens = $0.03
  • Verification: 15K tokens × $0.12/M tokens = $0.0018
  • Fix-up: 15K tokens × $0.12/M tokens = $0.0018

Total: ~$0.04 per run. The same run with a single frontier model (Claude Opus at $15/M tokens output) would cost roughly $1.50. That is a 37x difference — and the cheap model never touched the implementation phase, which is where quality matters most.

Why this works in practice

The objection is obvious: "If I use cheaper models for planning, won't the plan be worse?" The answer depends on whether a cheap model actually produces worse plans. For most tasks, the answer is no.

Planning is reading comprehension, not creative generation. A good plan for refactoring a Python module looks the same whether Claude Opus wrote it or DeepSeek Flash wrote it. The quality of the plan depends on the quality of your PROMPT.md — not the model that reads it.

The implementation phase is different. That is where the model needs to understand edge cases, design defensively, and produce code you would actually merge. That is where frontier models earn their premium.

Cost model routing is not about using cheap models everywhere. It is about using the right model for each phase — and most phases do not need the right model to be the most expensive one.

The vendor lock-in problem

Here is why most AI coding tools will never ship this feature: it is bad for their business.

If you run Claude Code, every token goes through Anthropic. If you run Cursor, every request goes through OpenAI (or Anthropic, depending on your plan). If you run Codex CLI, you are locked into Google's models. These tools have no incentive to help you route work to cheaper competitors.

The model provider and the workflow tool are the same company. That is a conflict of interest. The workflow tool wants you to use the cheapest model that gets the job done. The model provider wants you to use the most expensive model they sell. When the same entity wears both hats, cheap-model routing will always lose.

This is the structural reason Ralph Workflow exists as a vendor-neutral, open-source orchestrator. It does not sell tokens. It does not care which model you use. The routing decision belongs in the repo config, where the team that pays the bill actually owns it.

When cost model routing pays off

Cost model routing makes sense when:

You run unattended coding tasks regularly. The cost difference compounds. One run per night saves $500+/month compared to single-model frontier runs.

Your tasks have distinct phases. If your workflow has planning → implementation → verification → fix-up, you have four opportunities to save.

You use multiple AI providers already. If you already have API keys for DeepSeek, Minimax, and Anthropic, adding routing is zero additional setup.

You care about cost predictability. Budget forecasting is easier when $7 out of every $10 in token spend is known to be non-frontier.

Cost model routing does not make sense when you run one-off, interactive coding sessions inside an IDE. The overhead of switching models mid-session is not worth it for a single request. This is for autonomous workflow runs — the kind you start, walk away from, and review in the morning.

Setting this up in Ralph Workflow

Ralph Workflow ships with model routing built in. You define the model map once in your repo config, and every phase of every run picks the right model automatically.

pipx install ralph-workflow
ralph --pipeline build

That is the entire setup. The routing happens inside the workflow runner — you do not need to manage model selection per-phase or remember which CLI uses which model. Write your PROMPT.md, start the run, and Ralph Workflow routes each phase to the model you configured.

The config file is checked into your repo, so your whole team gets the same cost structure. When DeepSeek drops prices or a new model ships, you update one YAML file and every future run picks it up. No per-engineer configuration drift.

Where this is going

Cost model routing today is about phase-level assignment: planning gets one model, implementation gets another, verification gets a third. The next step is more granular. Some future runs will split implementation across models too — route boilerplate-heavy files to cheap models and logic-heavy files to frontier models, all within a single run.

The principle does not change: tokens are not free, and not everything deserves the most expensive model on the market. The tools that help you control this will outlast the tools that profit from ignoring it.


Ralph Workflow is free, open source (AGPL-3.0), and vendor-neutral. Get started on Codeberg. A mirror is also available on GitHub.

Best evaluator path

Turn the idea into a real overnight test, not another saved tab.

Codeberg-first: open the primary repo, choose one bounded backlog task, run it tonight, and ask one question tomorrow morning — would I merge this? GitHub stays available as the mirror.

Open the primary Codeberg repo

Read the public source before you install anything.

Pick a first task

Use the guide to choose a bounded backlog item that is honest to review.

Install and run Ralph Workflow

Keep the machine awake, then decide in the morning whether the diff is good enough to merge.