Complexity Detection & Model Routing

Pilot classifies every incoming task by complexity before execution begins. This classification drives three routing decisions: which model handles the task, how long it’s allowed to run, and how much reasoning effort the model applies. The result is lower cost for simple work and deeper reasoning for architectural changes — automatically, with no manual intervention.

The Five Complexity Levels

Every task maps to one of five levels. Detection runs against the combined issue title and description.

Level	Description	Example Tasks
Trivial	Minimal, mechanical changes	Fix typo, add log line, rename variable, bump version, lint fix
Simple	Small focused changes	Add field, minor fix, update config, add test case
Medium	Standard feature work	New endpoint, component, integration, moderate changes
Complex	Architectural changes	Refactor, migration, redesign, multi-file restructure
Epic	Too large for a single PR	Multi-phase projects, roadmaps, system overhauls

How Detection Works

Detection runs in DetectComplexity() and follows a strict priority order:


1. Epic check     → Is this too large for one execution cycle?
2. Trivial match  → Does it match a trivial keyword pattern?
3. Complex match  → Does it match an architectural keyword pattern?
4. Simple match   → Does it match a small-change keyword pattern?
5. Word count     → Fall back to description length heuristics

Pattern Matching

Each level has a set of keyword patterns matched against the lowercased title + description:

Trivial patterns — mechanical changes that need minimal reasoning:


fix typo, typo, add log, add logging, update comment, fix comment,
rename variable, rename function, rename, remove unused, delete unused,
bump version, update version, fix import, add import, fix whitespace,
formatting, lint fix

Simple patterns — small but require some thought:


add field, add property, add parameter, add argument, small fix,
minor fix, quick fix, update config, change config, update constant,
add constant, add test case, fix test

Complex patterns — architectural consideration required:


refactor, rewrite, redesign, migrate, migration, architecture,
restructure, overhaul, system, database schema, api design,
multi-file, cross-cutting

Word Count Fallback

When no keyword patterns match, the description length (with code blocks stripped) determines complexity:

Word Count	Classification
< 10 words	Simple
10–49 words	Medium
≥ 50 words	Complex

Code blocks are stripped before word counting to avoid inflated counts from embedded snippets. A 200-word description with a 500-line code sample is classified by the 200 prose words, not 700 total.

Epic Detection

Epic detection uses combination rules — no single signal (except [epic] tag) triggers epic mode alone. This prevents false positives from normal implementation plans that happen to mention phases.

Signals Collected

Signal	How It’s Measured
`[epic]` tag	Regex match in title
Phase count	Headers matching `Phase N`, `Stage N`, `Part N`, `Milestone N`
Checkbox count	Markdown checkboxes `- [ ]` or `- [x]`
Word count	Prose words after stripping code blocks
Epic keywords	Standalone words: `epic`, `roadmap`, `multi-phase`, `milestone`
Structural markers	Presence of `##` headers, `phase`/`stage`/`step` in text

Combination Rules


[epic] tag                                          → always epic
5+ phases                                           → epic
epic keywords + (3+ phases OR 5+ checkboxes OR 200+ words) → epic
7+ checkboxes + (200+ words OR 3+ phases)           → epic
300+ words + structural markers + (5+ checkboxes OR 2+ phases) → epic

Three phases alone is not epic — that’s a normal implementation plan. The threshold is 5+ phases as a standalone signal, or 3+ phases combined with other structural indicators.

What Happens When Epic Triggers

Pilot sends the full issue to Claude for planning
The response is parsed into 3–5 sequential subtasks
Child issues are created on GitHub, each labeled pilot
Parent issue moves to pilot-in-progress
The poller picks up subtasks sequentially

Planning vs Execution Model Split

Pilot runs two distinct subprocesses per task: a planning pass that decomposes the issue and a sequence of execution passes that write the code. They have very different cost/quality trade-offs, so each gets its own model.

Phase	Default Model	Why
Planning (epic decomposition)	`claude-opus-4-7`	Stronger reasoning produces better subtask boundaries; runs once per epic
Execution (every subtask, including `complex`)	`claude-sonnet-4-6`	Cheaper per token, near-Opus quality on focused implementation work

Planning runs once and emits a structured plan; execution runs many times per task across self-review, retry, and feedback loops. Reserving Opus for planning and pinning execution to Sonnet (including the complex complexity tier) cuts the per-task token spend by roughly 70% versus running everything on Opus.


executor:
  planning:
    model: "claude-opus-4-7"   # default; override per deployment
  model_routing:
    enabled: true
    complex: "claude-sonnet-4-6"  # GH-2432: Opus reserved for planning

See Configuration → Executor for the full key list and executor.allowed_tools / executor.mcp_config_path knobs that complement the split. Tracked under GH-2432 (v2.100.5).

Routing Tables

Model Routing

Maps complexity to the AI model used for execution.

Complexity	Default Model	Pricing (input/output)	Rationale
Trivial	`claude-haiku-4-5`	$0.80 / $4 per 1M tokens	Speed over depth — mechanical changes don’t need reasoning
Simple	`claude-sonnet-4-6`	$3 / $15 per 1M tokens	Near-Opus quality at 40% lower cost
Medium	`claude-sonnet-4-6`	$3 / $15 per 1M tokens	Standard feature work, near-Opus quality
Complex	`claude-sonnet-4-6`	$3 / $15 per 1M tokens	Sonnet handles architectural work; Opus is reserved for the planning pass (GH-2432)

Sonnet 4.6 is the default across simple, medium, and complex execution tiers. Opus is invoked once per epic during the planning pass (executor.planning.model), not on the execution path. Epic tasks are decomposed before execution, so each subtask routes independently.

Timeout Routing

Prevents stuck tasks from running indefinitely. Calibrated per complexity level.

Complexity	Default Timeout	Purpose
Trivial	5 minutes	Typo fixes shouldn’t take longer
Simple	10 minutes	Small changes with room for quality gates
Medium	30 minutes	Standard feature work including research phase
Complex	60 minutes	Architectural changes with full context intelligence workflow
Fallback	30 minutes	Used when timeout config is missing or parse fails

Effort Routing

Controls how many tokens Claude uses when responding — the depth of reasoning applied to the task.

Complexity	Effort Level	Behavior
Trivial	`low`	Minimal tokens, fast responses — no deep analysis needed
Simple	`medium`	Balanced — enough reasoning for focused fixes
Medium	`high`	Standard Claude behavior — full reasoning chain
Complex	`max`	Deepest reasoning available — exhaustive analysis before acting

Effort maps to Claude API’s output_config.effort parameter or the --effort flag in Claude Code.

Cost Examples

Real-world cost estimates based on typical token consumption per complexity level:

Complexity	Typical Tokens (in/out)	Estimated Cost	Example Task
Trivial	~2K / ~500	~$0.02	Fix typo in README
Simple	~15K / ~5K	~$0.20	Add a config field with validation
Medium	~50K / ~20K	~$0.75	New API endpoint with tests
Complex	~100K / ~50K	~$1.50–$3.00	Refactor authentication system
Epic (per subtask)	~50K / ~20K	~$0.75 each	Phase of a multi-part migration

These are estimates. Actual costs depend on codebase size (more context = more input tokens), quality gate retries, and self-review passes. Enable budget limits to cap per-task spending.

Behavioral Routing

Beyond model and timeout, complexity drives execution behavior:

Behavior	Trivial	Simple	Medium	Complex	Epic
Context injection	Skip	Full	Full	Full	N/A (decomposed)
Research phase	Skip	Skip	Run	Run	N/A
Self-review	Run	Run	Run	Run	Per subtask
Quality gates	Run	Run	Run	Run	Per subtask
Task decomposition	No	No	No	If enabled	Always

Context injection: Trivial tasks skip context initialization to avoid ~30s overhead on a task that takes 60s total.

Research phase: Medium and complex tasks run a parallel research phase before implementation, giving Claude codebase context before writing code.

Outcome-Based Escalation

Pilot tracks execution outcomes per model and automatically escalates to a more capable model when failure rates exceed a threshold.

How It Works

Every task execution records an outcome in the model_outcomes SQLite table:

Column	Description
`task_type`	Normalized task category (e.g., `endpoint`, `refactor`, `test`)
`model`	Model used for execution
`outcome`	`success` or `failure`
`tokens`	Total tokens consumed
`duration`	Execution wall time

Before each execution, ShouldEscalate() checks the last 10 outcomes for the current task type and model. If the failure rate exceeds the configured threshold (default 30%), Pilot escalates to the next model tier:


claude-haiku-4-5 → claude-sonnet-4-6 → claude-opus-4-6

Escalation is logged with the task type, original model, escalated model, and observed failure rate.

Escalation is per task type — if Haiku fails frequently on refactor tasks but succeeds on typo tasks, only refactor tasks escalate. This keeps costs low for work the cheaper model handles well.

Escalation Configuration


executor:
  model_routing:
    escalation:
      enabled: true                # false by default
      threshold: 0.3               # 30% failure rate triggers escalation
      lookback: 10                 # number of recent outcomes to evaluate

When escalation.enabled is false (default), models are selected purely by complexity level with no outcome feedback.

Configuration

All routing is configured under the executor key in ~/.pilot/config.yaml.

Model Routing


executor:
  model_routing:
    enabled: true                  # false by default
    trivial: "claude-haiku-4-5"     # fastest, cheapest
    simple: "claude-sonnet-4-6"    # near-Opus quality, 40% cheaper
    medium: "claude-sonnet-4-6"
    complex: "claude-sonnet-4-6"  # Opus is reserved for planning (see above)

When enabled: false (default), all tasks use the backend’s default model — no routing occurs.

Timeout Configuration


executor:
  timeout:
    default: "30m"     # fallback for unknown complexity
    trivial: "5m"
    simple: "10m"
    medium: "30m"
    complex: "60m"

Timeouts are always active (no enabled flag). Values use Go duration format (5m, 1h, 30s).

Effort Routing


executor:
  effort_routing:
    enabled: true        # false by default
    trivial: "low"       # minimal token spend
    simple: "medium"     # balanced
    medium: "high"       # standard (Claude default)
    complex: "max"       # deepest reasoning

When enabled: false, effort is not set and Claude uses its default (high).

Task Decomposition


executor:
  decompose:
    enabled: false                # opt-in
    min_complexity: "complex"     # minimum level to trigger
    max_subtasks: 5               # range: 2–10
    min_description_words: 50     # short descriptions don't decompose

Decomposition is separate from epic detection. Epics always decompose. This config controls whether complex tasks are also automatically split.

Model routing and effort routing are both disabled by default. Enable them explicitly to benefit from cost optimization. Without routing, every task uses the same model and effort level regardless of complexity.

Full Example


executor:
  model_routing:
    enabled: true
    trivial: "claude-haiku-4-5"
    simple: "claude-sonnet-4-6"
    medium: "claude-sonnet-4-6"
    complex: "claude-sonnet-4-6"  # Opus is reserved for planning (see above)
    escalation:
      enabled: true
      threshold: 0.3
      lookback: 10
 
  timeout:
    default: "30m"
    trivial: "5m"
    simple: "10m"
    medium: "30m"
    complex: "60m"
 
  effort_routing:
    enabled: true
    trivial: "low"
    simple: "medium"
    medium: "high"
    complex: "max"
 
  decompose:
    enabled: true
    min_complexity: "complex"
    max_subtasks: 5
    min_description_words: 50

How It Flows Together


Issue arrives
     │
     ▼
┌─────────────────────┐
│  DetectComplexity()  │  Pattern match → word count fallback
└──────────┬──────────┘
           │
     ┌─────┴──────┐
     │  Epic?      │──── Yes ──→ PlanEpic() → CreateSubIssues() → stop
     └─────┬──────┘
           │ No
     ┌─────┴──────┐
     │ Decompose? │──── Yes ──→ Split into subtasks → execute each
     └─────┬──────┘
           │ No
     ┌─────┴──────────────┐
     │ ModelRouter         │
     │  .SelectModel()     │──→ "claude-haiku-4-5", "claude-sonnet-4-6", or "claude-opus-4-6"
     │  .SelectTimeout()   │──→ 5m / 10m / 30m / 60m
     │  .SelectEffort()    │──→ "low" / "medium" / "high" / "max"
     └──────────┬─────────┘
                │
     ┌──────────▼─────────┐
     │ ShouldEscalate()?   │──→ Check failure rate for task type + model
     │  Yes → bump model   │    haiku → sonnet → opus
     │  No  → keep model   │
     └──────────┬─────────┘
                │
     ┌──────────▼─────────┐
     │ Execute with        │
     │ selected params     │  model, timeout, effort applied
     └──────────┬─────────┘
                │
     ┌──────────▼─────────┐
     │ Quality Gates       │  build → test → lint → coverage
     │ + Self Review       │
     └──────────┬─────────┘
                │
     ┌──────────▼─────────┐
     │ Push + Create PR    │
     └────────────────────┘

What’s Next

Architecture — Full system architecture and autopilot state machine
Autopilot Mode — CI monitoring, auto-merge, and feedback loops