Skip to main content

Failure Handling

When a gate fails, a guard triggers, or a review rejects the code, Gump needs to know what to do. The on_failure field defines the recovery strategy.

Without on_failure

If a step has no on_failure, any failure is fatal — the run stops immediately.
- name: impl
  agent: claude-sonnet
  gate: [compile, test]
  # no on_failure → gate fail = run stops

Basic on_failure

- name: impl
  agent: claude-sonnet
  gate: [compile, test]
  on_failure:
    retry: 3
    strategy: [same, "escalate: claude-opus"]
  • retry — maximum number of additional attempts
  • strategy — what to do on each attempt, in order

Strategy options

same

Retry with the same agent. The agent receives the error context ({error} and {diff}) from the failed attempt.
strategy: [same, same, same]

same: N (shorthand)

strategy: ["same: 3"]
Equivalent to [same, same, same].

escalate: agent

Switch to a more powerful (usually more expensive) agent.
strategy: ["same: 2", "escalate: claude-sonnet", "escalate: claude-opus"]
The escalated agent starts with a fresh session (different agent, different context window).

Combining strategies

strategy:
  - same
  - same
  - "escalate: claude-haiku"
  - "escalate: claude-sonnet"
  - "escalate: claude-opus"
Gump walks through the strategy list in order. When the list is exhausted, the circuit breaker triggers and the step (or run) is marked fatal.

restart_from

Restart from an earlier step in the same group instead of retrying the current step:
- name: build
  foreach: decompose
  steps:
    - name: tests
      agent: claude-haiku
      gate: [compile, tests_found]

    - name: impl
      agent: claude-haiku
      session: reuse
      gate: [compile, test]
      on_failure:
        retry: 5
        strategy: ["same: 3", "escalate: claude-sonnet"]
        restart_from: tests
If impl fails after all retries, restart_from: tests goes back to the tests step. The worktree is reset to the pre-tests state and the state bag is cleaned (previous outputs moved to “prev”, not destroyed). Each restart consumes one attempt from the retry budget. The reasoning: if the agent can’t implement what the tests demand, maybe the tests are poorly designed.

Conditional on_failure

Route failure handling differently depending on what failed:
on_failure:
  gate_fail:
    retry: 3
    strategy: [same, "escalate: claude-opus"]
    restart_from: tests
  guard_fail:
    retry: 1
    strategy: ["escalate: claude-opus"]
  review_fail:
    retry: 2
    strategy: [same]
  • gate_fail — a gate (compile, test, lint) failed
  • guard_fail — a guard (max_turns, max_budget, no_write) killed the agent
  • review_fail — a review step returned pass: false
Each source has its own retry counter. If a source isn’t listed, it falls back to gate_fail. If gate_fail isn’t listed either, the failure is fatal. The simple form and the conditional form are mutually exclusive.

What happens on retry

  1. The worktree is reset to the pre-step commit (git reset --hard)
  2. The session is fresh (unless session: reuse-on-retry)
  3. {error} is injected with the gate’s stderr output (truncated to 2000 chars)
  4. {diff} is injected with the failed attempt’s diff (truncated to 3000 chars)
  5. The agent launches with the full error context
  6. If this attempt also fails, the next strategy entry is used

Circuit breaker

When all strategies are exhausted, the step is marked fatal. If the step is inside a group with its own on_failure, the group’s retry kicks in. Otherwise, the run stops. The circuit breaker emits a circuit_breaker event in the ledger with the reason and the number of attempts.

Group-level on_failure

Groups (orchestration steps) can have their own on_failure:
- name: build
  foreach: decompose
  steps:
    - name: impl
      agent: claude-sonnet
      gate: [compile, test]
      on_failure:
        retry: 2
        strategy: [same]
    - name: review
      agent: claude-opus
      output: review
  gate: [compile, test]
  on_failure:
    retry: 3
    restart_from: impl
The inner on_failure handles per-step retries. If the group gate fails after all inner retries, the outer on_failure retries the entire group from impl. The worktree is reset to the pre-group state.