Skip to Content
ConceptsModel Routing

Complexity Detection & Model Routing

Pilot classifies every incoming task by complexity before execution begins. This classification drives three routing decisions: which model handles the task, how long it’s allowed to run, and how much reasoning effort the model applies. The result is lower cost for simple work and deeper reasoning for architectural changes — automatically, with no manual intervention.

The Five Complexity Levels

Every task maps to one of five levels. Detection runs against the combined issue title and description.

LevelDescriptionExample Tasks
TrivialMinimal, mechanical changesFix typo, add log line, rename variable, bump version, lint fix
SimpleSmall focused changesAdd field, minor fix, update config, add test case
MediumStandard feature workNew endpoint, component, integration, moderate changes
ComplexArchitectural changesRefactor, migration, redesign, multi-file restructure
EpicToo large for a single PRMulti-phase projects, roadmaps, system overhauls

How Detection Works

Detection runs in DetectComplexity() and follows a strict priority order:

1. Epic check → Is this too large for one execution cycle? 2. Trivial match → Does it match a trivial keyword pattern? 3. Complex match → Does it match an architectural keyword pattern? 4. Simple match → Does it match a small-change keyword pattern? 5. Word count → Fall back to description length heuristics

Pattern Matching

Each level has a set of keyword patterns matched against the lowercased title + description:

Trivial patterns — mechanical changes that need minimal reasoning:

fix typo, typo, add log, add logging, update comment, fix comment, rename variable, rename function, rename, remove unused, delete unused, bump version, update version, fix import, add import, fix whitespace, formatting, lint fix

Simple patterns — small but require some thought:

add field, add property, add parameter, add argument, small fix, minor fix, quick fix, update config, change config, update constant, add constant, add test case, fix test

Complex patterns — architectural consideration required:

refactor, rewrite, redesign, migrate, migration, architecture, restructure, overhaul, system, database schema, api design, multi-file, cross-cutting

Word Count Fallback

When no keyword patterns match, the description length (with code blocks stripped) determines complexity:

Word CountClassification
< 10 wordsSimple
10–49 wordsMedium
≥ 50 wordsComplex

Code blocks are stripped before word counting to avoid inflated counts from embedded snippets. A 200-word description with a 500-line code sample is classified by the 200 prose words, not 700 total.

Epic Detection

Epic detection uses combination rules — no single signal (except [epic] tag) triggers epic mode alone. This prevents false positives from normal implementation plans that happen to mention phases.

Signals Collected

SignalHow It’s Measured
[epic] tagRegex match in title
Phase countHeaders matching Phase N, Stage N, Part N, Milestone N
Checkbox countMarkdown checkboxes - [ ] or - [x]
Word countProse words after stripping code blocks
Epic keywordsStandalone words: epic, roadmap, multi-phase, milestone
Structural markersPresence of ## headers, phase/stage/step in text

Combination Rules

[epic] tag → always epic 5+ phases → epic epic keywords + (3+ phases OR 5+ checkboxes OR 200+ words) → epic 7+ checkboxes + (200+ words OR 3+ phases) → epic 300+ words + structural markers + (5+ checkboxes OR 2+ phases) → epic

Three phases alone is not epic — that’s a normal implementation plan. The threshold is 5+ phases as a standalone signal, or 3+ phases combined with other structural indicators.

What Happens When Epic Triggers

  1. Pilot sends the full issue to Claude for planning
  2. The response is parsed into 3–5 sequential subtasks
  3. Child issues are created on GitHub, each labeled pilot
  4. Parent issue moves to pilot-in-progress
  5. The poller picks up subtasks sequentially

Routing Tables

Model Routing

Maps complexity to the AI model used for execution.

ComplexityDefault ModelPricing (input/output)Rationale
Trivialclaude-haiku-4-5$0.80 / $4 per 1M tokensSpeed over depth — mechanical changes don’t need reasoning
Simpleclaude-sonnet-4-6$3 / $15 per 1M tokensNear-Opus quality at 40% lower cost
Mediumclaude-sonnet-4-6$3 / $15 per 1M tokensStandard feature work, near-Opus quality
Complexclaude-opus-4-6$5 / $25 per 1M tokensArchitectural work gets the most capable model

Sonnet 4.6 is the default for simple/medium tasks — near-Opus quality at 40% lower cost. Complex tasks use Opus 4.6 for maximum reasoning depth. Epic tasks are decomposed before execution, so each subtask routes independently.

Timeout Routing

Prevents stuck tasks from running indefinitely. Calibrated per complexity level.

ComplexityDefault TimeoutPurpose
Trivial5 minutesTypo fixes shouldn’t take longer
Simple10 minutesSmall changes with room for quality gates
Medium30 minutesStandard feature work including research phase
Complex60 minutesArchitectural changes with full context intelligence workflow
Fallback30 minutesUsed when timeout config is missing or parse fails

Effort Routing

Controls how many tokens Claude uses when responding — the depth of reasoning applied to the task.

ComplexityEffort LevelBehavior
TriviallowMinimal tokens, fast responses — no deep analysis needed
SimplemediumBalanced — enough reasoning for focused fixes
MediumhighStandard Claude behavior — full reasoning chain
ComplexmaxDeepest reasoning available — exhaustive analysis before acting

Effort maps to Claude API’s output_config.effort parameter or the --effort flag in Claude Code.

Cost Examples

Real-world cost estimates based on typical token consumption per complexity level:

ComplexityTypical Tokens (in/out)Estimated CostExample Task
Trivial~2K / ~500~$0.02Fix typo in README
Simple~15K / ~5K~$0.20Add a config field with validation
Medium~50K / ~20K~$0.75New API endpoint with tests
Complex~100K / ~50K~$1.50–$3.00Refactor authentication system
Epic (per subtask)~50K / ~20K~$0.75 eachPhase of a multi-part migration

These are estimates. Actual costs depend on codebase size (more context = more input tokens), quality gate retries, and self-review passes. Enable budget limits to cap per-task spending.

Behavioral Routing

Beyond model and timeout, complexity drives execution behavior:

BehaviorTrivialSimpleMediumComplexEpic
Context injectionSkipFullFullFullN/A (decomposed)
Research phaseSkipSkipRunRunN/A
Self-reviewRunRunRunRunPer subtask
Quality gatesRunRunRunRunPer subtask
Task decompositionNoNoNoIf enabledAlways

Context injection: Trivial tasks skip context initialization to avoid ~30s overhead on a task that takes 60s total.

Research phase: Medium and complex tasks run a parallel research phase before implementation, giving Claude codebase context before writing code.

Outcome-Based Escalation

Starting in v2.56.0, Pilot tracks execution outcomes per model and automatically escalates to a more capable model when failure rates exceed a threshold.

How It Works

Every task execution records an outcome in the model_outcomes SQLite table:

ColumnDescription
task_typeNormalized task category (e.g., endpoint, refactor, test)
modelModel used for execution
outcomesuccess or failure
tokensTotal tokens consumed
durationExecution wall time

Before each execution, ShouldEscalate() checks the last 10 outcomes for the current task type and model. If the failure rate exceeds the configured threshold (default 30%), Pilot escalates to the next model tier:

claude-haiku-4-5 → claude-sonnet-4-6 → claude-opus-4-6

Escalation is logged with the task type, original model, escalated model, and observed failure rate.

Escalation is per task type — if Haiku fails frequently on refactor tasks but succeeds on typo tasks, only refactor tasks escalate. This keeps costs low for work the cheaper model handles well.

Escalation Configuration

executor: model_routing: escalation: enabled: true # false by default threshold: 0.3 # 30% failure rate triggers escalation lookback: 10 # number of recent outcomes to evaluate

When escalation.enabled is false (default), models are selected purely by complexity level with no outcome feedback.

Configuration

All routing is configured under the executor key in ~/.pilot/config.yaml.

Model Routing

executor: model_routing: enabled: true # false by default trivial: "claude-haiku-4-5" # fastest, cheapest simple: "claude-sonnet-4-6" # near-Opus quality, 40% cheaper medium: "claude-sonnet-4-6" complex: "claude-opus-4-6"

When enabled: false (default), all tasks use the backend’s default model — no routing occurs.

Timeout Configuration

executor: timeout: default: "30m" # fallback for unknown complexity trivial: "5m" simple: "10m" medium: "30m" complex: "60m"

Timeouts are always active (no enabled flag). Values use Go duration format (5m, 1h, 30s).

Effort Routing

executor: effort_routing: enabled: true # false by default trivial: "low" # minimal token spend simple: "medium" # balanced medium: "high" # standard (Claude default) complex: "max" # deepest reasoning

When enabled: false, effort is not set and Claude uses its default (high).

Task Decomposition

executor: decompose: enabled: false # opt-in min_complexity: "complex" # minimum level to trigger max_subtasks: 5 # range: 2–10 min_description_words: 50 # short descriptions don't decompose

Decomposition is separate from epic detection. Epics always decompose. This config controls whether complex tasks are also automatically split.

Model routing and effort routing are both disabled by default. Enable them explicitly to benefit from cost optimization. Without routing, every task uses the same model and effort level regardless of complexity.

Full Example

executor: model_routing: enabled: true trivial: "claude-haiku-4-5" simple: "claude-sonnet-4-6" medium: "claude-sonnet-4-6" complex: "claude-opus-4-6" escalation: enabled: true threshold: 0.3 lookback: 10 timeout: default: "30m" trivial: "5m" simple: "10m" medium: "30m" complex: "60m" effort_routing: enabled: true trivial: "low" simple: "medium" medium: "high" complex: "max" decompose: enabled: true min_complexity: "complex" max_subtasks: 5 min_description_words: 50

How It Flows Together

Issue arrives ┌─────────────────────┐ │ DetectComplexity() │ Pattern match → word count fallback └──────────┬──────────┘ ┌─────┴──────┐ │ Epic? │──── Yes ──→ PlanEpic() → CreateSubIssues() → stop └─────┬──────┘ │ No ┌─────┴──────┐ │ Decompose? │──── Yes ──→ Split into subtasks → execute each └─────┬──────┘ │ No ┌─────┴──────────────┐ │ ModelRouter │ │ .SelectModel() │──→ "claude-haiku-4-5", "claude-sonnet-4-6", or "claude-opus-4-6" │ .SelectTimeout() │──→ 5m / 10m / 30m / 60m │ .SelectEffort() │──→ "low" / "medium" / "high" / "max" └──────────┬─────────┘ ┌──────────▼─────────┐ │ ShouldEscalate()? │──→ Check failure rate for task type + model │ Yes → bump model │ haiku → sonnet → opus │ No → keep model │ └──────────┬─────────┘ ┌──────────▼─────────┐ │ Execute with │ │ selected params │ model, timeout, effort applied └──────────┬─────────┘ ┌──────────▼─────────┐ │ Quality Gates │ build → test → lint → coverage │ + Self Review │ └──────────┬─────────┘ ┌──────────▼─────────┐ │ Push + Create PR │ └────────────────────┘

What’s Next

  • Architecture — Full system architecture and autopilot state machine
  • Autopilot Mode — CI monitoring, auto-merge, and feedback loops