Skip to main content
offline local-llm privacy ollama autonomous-coding

Can You Actually Run AI Coding Agents Offline? A Practical Guide to Local LLM Development

Air-gapped development with Ollama, local models, and coding agents that don't phone home. What works, what breaks, and how to build an offline pipeline that produces real software.

Codeberg-first

Ralph Workflow is free and open source. Inspect the primary repo on Codeberg before you install — or jump to the GitHub mirror.

Most AI coding tools assume you're online. Claude, ChatGPT, Copilot — every keystroke goes through someone else's datacenter. That's fine for open-source side projects. It's not fine when you work on proprietary code, HIPAA data, defense contracts, or infrastructure that can't leave your network.

So the question comes up constantly: can you actually do serious coding with a local LLM and no internet?

The short answer: yes, and the workflow layer matters more than the model.

What works (better than you'd expect)

Ollama + Continue + VS Code is the current sweet spot. Run a 14B–32B parameter model on an RTX 4090 or M-series MacBook and you get competent autocomplete, decent function generation, and solid refactoring. The models that matter:

Model Size Strengths RAM Needed
Codestral 22B 22B Fill-in-middle, function gen ~16 GB
DeepSeek Coder V2 16B Reasoning, architecture ~12 GB
Qwen 2.5 Coder 14B/32B Tool calling, long context 10–24 GB
Llama 4 Scout 17B General coding + explanation ~12 GB

For most tasks shorter than 200 lines, you won't notice a meaningful difference from cloud models. The gap shows up in long-context refactoring and complex multi-file orchestration — which is where the workflow layer earns its keep.

What breaks (and how to fix it)

1. Tool calling degrades

Local models are worse at tool use than Claude or GPT-4. They'll drift off-spec, call the wrong tool, or hallucinate parameters. The fix: phase gates. Break your run into analysis → plan → implement → verify. Each phase produces a concrete artifact that gate-checks before the next one starts. If the agent drifts in implementation, the verification phase catches it because it reads the plan artifact, not the agent's memory.

2. Context window exhaustion

Cloud models have 200K token windows. Local models on consumer hardware top out around 32K–64K. That means you can't throw your entire codebase at the agent and ask it to refactor. The fix: scope phases to single files or small modules. The Ralph Workflow phase-gate architecture was designed for exactly this constraint — each phase gets a bounded input, produces a bounded output, and hands off cleanly.

3. No web search / browsing

Your online agent finds the docs, reads the API reference, checks StackOverflow. The offline agent can't. The fix: pre-load relevant docs into the workspace before starting the run. Copy API reference files, relevant documentation, and example code into the project directory. The agent can search local files as well as it can search the web — it just needs the right files.

4. Model selection gets stuck on one size

The temptation is to run everything through the biggest local model you can fit. But a 32B reasoning model is overkill for "rename this variable" and a 7B speed model is underkill for "design the database schema." The fix: cost-aware model routing — even offline. Route simple tasks to smaller/faster models and reserve the big ones for architecture and review.

The workflow layer is the differentiator

Here's the pattern that most people miss: the quality gap between a local 14B model and Claude isn't the model — it's that Claude Code has a built-in workflow (permission system, conversation management, tool orchestration) and your local model is just an API endpoint.

When you add a phase-gated workflow on top of a local model — one that enforces analysis artifacts, planning documents, and verification checkpoints — the local model's output approaches the quality of an attended cloud session. Not because the model got smarter, but because the workflow prevented the failure modes that eat 80% of the token budget.

Try it tonight on your own hardware

  1. Install Ollama: curl -fsSL https://ollama.com/install.sh | sh
  2. Pull a coding model: ollama pull qwen2.5-coder:14b
  3. Write a one-paragraph spec for a task you've been putting off
  4. Run it through a phase-gated workflow and see what comes out

The first run will surprise you. Not because the model is amazing — because the structure prevents the mistakes you didn't know you were accepting from attended cloud sessions.


Build your offline pipeline: First-task guide →

Primary repo (Codeberg): RalphWorkflow/Ralph-Workflow
GitHub mirror: Ralph-Workflow

Ralph Workflow is vendor-neutral — it works with any API endpoint, including local Ollama instances. No cloud dependency required.

AI Cost Model Routing: Stop Paying Frontier Prices for Grunt Work

Most AI coding tools burn expensive tokens on boilerplate and planning. Cost model routing — using cheap models for analysis and strong models for implementation — cuts costs by 60% or more without sacrificing quality. Here's how it works and why your workflow tool should support it.

cost-optimization model-routing

Best evaluator path

Turn the idea into a real overnight test, not another saved tab.

Codeberg-first: open the primary repo, choose one bounded backlog task, run it tonight, and ask one question tomorrow morning — would I merge this? GitHub stays available as the mirror.

Open the primary Codeberg repo

Read the public source before you install anything.

Pick a first task

Use the guide to choose a bounded backlog item that is honest to review.

Install and run Ralph Workflow

Keep the machine awake, then decide in the morning whether the diff is good enough to merge.