AI Cost Model Routing: Stop Paying Frontier Prices for Grunt Work
Most AI coding tools burn expensive tokens on boilerplate and planning. Cost model routing — using cheap models for analysis and strong models for implementation — cuts costs by 60% or more without sacrificing quality. Here's how it works and why your workflow tool should support it.
Codeberg-first
Ralph Workflow is free and open source. Inspect the primary repo on Codeberg before you install — or jump to the GitHub mirror.
Developers who run AI coding agents overnight already know the pattern. Your morning invoice arrives and you realize half the spend went on tasks that any model could have done. The planning pass burned Claude Opus tokens reading your README. The fix-up loop burned frontier model credits on whitespace formatting. The verification gate used the most expensive model in your fleet to run a diff.
This is not a design accident — it is a structural defect in how most AI coding tools work. They assume every phase of a coding run deserves the same model. That assumption is wrong. It is also expensive.
The real cost structure of an unattended coding run
An unattended coding run has distinct phases with wildly different model requirements:
| Phase | What happens | Required intelligence |
|---|---|---|
| Planning | Read repo structure, parse PROMPT.md, produce a step-by-step plan | Medium — needs to understand code but not write it |
| Analysis | Read relevant source files, check conventions, grep for patterns | Low — information retrieval, not creative work |
| Implementation | Write the actual code, design module boundaries, handle edge cases | High — this is where quality lives or dies |
| Verification | Run tests, check invariants, validate output shape | Low to Medium — pass/fail decisions with small context windows |
| Fix-up | Read test failures, make targeted corrections | Medium — narrow scope, clear feedback signal |
In a typical run, the implementation phase represents about 30% of the total token spend. The other 70% — planning, analysis, verification, fix-up — burns frontier model credits on work that does not actually need frontier capability.
If you spend $10 on a coding run, roughly $7 of that is wasted on phases that a cheaper model could have handled. At one run per night, that is over $200/month of pure waste.
How cost model routing works
The idea is straightforward: assign different models to different phases of the workflow based on what each phase actually needs.
# .ralph/config.yml
models:
plan: "openrouter/deepseek-v4-flash" # $0.12/M — smart enough for planning
analyze: "minimax/minimax-m2.7" # $0.20/M — fast, cheap, reliable
implement: "openrouter/deepseek-v4-pro" # $1.00/M — the heavy lifter
verify: "openrouter/deepseek-v4-flash" # $0.12/M — reviewing, not creating
fix: "openrouter/deepseek-v4-flash" # $0.12/M — targeted corrections
A 100K-token run under this configuration costs roughly:
- Planning: 25K tokens × $0.12/M tokens = $0.003
- Analysis: 15K tokens × $0.20/M tokens = $0.003
- Implementation: 30K tokens × $1.00/M tokens = $0.03
- Verification: 15K tokens × $0.12/M tokens = $0.0018
- Fix-up: 15K tokens × $0.12/M tokens = $0.0018
Total: ~$0.04 per run. The same run with a single frontier model (Claude Opus at $15/M tokens output) would cost roughly $1.50. That is a 37x difference — and the cheap model never touched the implementation phase, which is where quality matters most.
Why this works in practice
The objection is obvious: "If I use cheaper models for planning, won't the plan be worse?" The answer depends on whether a cheap model actually produces worse plans. For most tasks, the answer is no.
Planning is reading comprehension, not creative generation. A good plan for refactoring a Python module looks the same whether Claude Opus wrote it or DeepSeek Flash wrote it. The quality of the plan depends on the quality of your PROMPT.md — not the model that reads it.
The implementation phase is different. That is where the model needs to understand edge cases, design defensively, and produce code you would actually merge. That is where frontier models earn their premium.
Cost model routing is not about using cheap models everywhere. It is about using the right model for each phase — and most phases do not need the right model to be the most expensive one.
The vendor lock-in problem
Here is why most AI coding tools will never ship this feature: it is bad for their business.
If you run Claude Code, every token goes through Anthropic. If you run Cursor, every request goes through OpenAI (or Anthropic, depending on your plan). If you run Codex CLI, you are locked into Google's models. These tools have no incentive to help you route work to cheaper competitors.
The model provider and the workflow tool are the same company. That is a conflict of interest. The workflow tool wants you to use the cheapest model that gets the job done. The model provider wants you to use the most expensive model they sell. When the same entity wears both hats, cheap-model routing will always lose.
This is the structural reason Ralph Workflow exists as a vendor-neutral, open-source orchestrator. It does not sell tokens. It does not care which model you use. The routing decision belongs in the repo config, where the team that pays the bill actually owns it.
When cost model routing pays off
Cost model routing makes sense when:
You run unattended coding tasks regularly. The cost difference compounds. One run per night saves $500+/month compared to single-model frontier runs.
Your tasks have distinct phases. If your workflow has planning → implementation → verification → fix-up, you have four opportunities to save.
You use multiple AI providers already. If you already have API keys for DeepSeek, Minimax, and Anthropic, adding routing is zero additional setup.
You care about cost predictability. Budget forecasting is easier when $7 out of every $10 in token spend is known to be non-frontier.
Cost model routing does not make sense when you run one-off, interactive coding sessions inside an IDE. The overhead of switching models mid-session is not worth it for a single request. This is for autonomous workflow runs — the kind you start, walk away from, and review in the morning.
Setting this up in Ralph Workflow
Ralph Workflow ships with model routing built in. You define the model map once in your repo config, and every phase of every run picks the right model automatically.
pipx install ralph-workflow
ralph --pipeline build
That is the entire setup. The routing happens inside the workflow runner — you do not need to manage model selection per-phase or remember which CLI uses which model. Write your PROMPT.md, start the run, and Ralph Workflow routes each phase to the model you configured.
The config file is checked into your repo, so your whole team gets the same cost structure. When DeepSeek drops prices or a new model ships, you update one YAML file and every future run picks it up. No per-engineer configuration drift.
Where this is going
Cost model routing today is about phase-level assignment: planning gets one model, implementation gets another, verification gets a third. The next step is more granular. Some future runs will split implementation across models too — route boilerplate-heavy files to cheap models and logic-heavy files to frontier models, all within a single run.
The principle does not change: tokens are not free, and not everything deserves the most expensive model on the market. The tools that help you control this will outlast the tools that profit from ignoring it.
Ralph Workflow is free, open source (AGPL-3.0), and vendor-neutral. Get started on Codeberg. A mirror is also available on GitHub.
Related Posts
Ralph Workflow Compared: A Practical Guide for Evaluating Autonomous Coding Tools
A structured comparison of Ralph Workflow against Aider, Claude Code, Cursor, Continue, GitHub Copilot, Conductor OSS, Conductor Teams, and Hermes Agent. Understand which tool fits your workflow, when you need autonomous coding vs pair programming, and how to evaluate the difference.
Is Ralph Workflow Right for Your Project? A Decision Guide
Not every project benefits from autonomous coding workflows. Here is how to decide whether yours does, and what you need in place before you start.
Good vs Bad Unattended AI Coding Tasks: How to Know Before You Start
Not every backlog item works as an overnight coding run. Here is how to tell the good fits from the bad ones before you spend an evening setting it up.
Best evaluator path
Turn the idea into a real overnight test, not another saved tab.
Codeberg-first: open the primary repo, choose one bounded backlog task, run it tonight, and ask one question tomorrow morning — would I merge this? GitHub stays available as the mirror.
Open the primary Codeberg repo
Read the public source before you install anything.
Pick a first task
Use the guide to choose a bounded backlog item that is honest to review.
Install and run Ralph Workflow
Keep the machine awake, then decide in the morning whether the diff is good enough to merge.