error-recovery

📁 jwilger/agent-skills 📅 1 day ago

总安装量

周安装量

#51371

全站排名

安装命令

npx skills add https://github.com/jwilger/agent-skills --skill error-recovery

Agent 安装分布

opencode 4

gemini-cli 4

codebuddy 4

github-copilot 4

codex 4

kimi-cli 4

Skill 文档

Error Recovery

Value: Courage — autonomous operation requires resilience. Recovering from errors without human intervention keeps the workflow moving. Knowing when to escalate prevents wasted effort on unrecoverable situations.

Purpose

Teaches agents to handle unexpected errors during autonomous operation (API failures, build tool crashes, permission issues, resource exhaustion). Provides classification, retry strategies, and escalation rules. Prevents the failure modes of infinite retry loops, silent error swallowing, and unnecessary human interruptions for recoverable issues.

Practices

Classify Before Acting

When an error occurs, classify it before attempting recovery:

Category	Examples	Recovery
Transient	Network timeout, 503, rate limit, lock contention	Retry with backoff
Environmental	Missing dependency, wrong version, port conflict	Fix environment, then retry
Permission	File permission denied, auth token expired	Escalate to user
Logic	Assertion failure, type error, schema mismatch	Do NOT retry — investigate
Resource	Out of memory, disk full, context exhaustion	Reduce scope or escalate

Do not retry logic errors. If a test fails, an assertion fires, or a type mismatch occurs, retrying will produce the same result. Switch to the debugging protocol instead.

Retry Strategy: Exponential Backoff

For transient errors, retry with exponential backoff:

First retry: Wait 2 seconds
Second retry: Wait 5 seconds
Third retry: Wait 15 seconds
After third failure: Stop retrying and escalate

Never retry more than 3 times for the same error. Never retry without waiting. Never use a fixed retry loop without backoff.

Rate limit handling: If the error includes a Retry-After header or equivalent, respect it. Do not retry before the indicated time.

Error Logging

When an error occurs, log it to a structured format before attempting recovery:

## Error Log: [timestamp]

- **Category:** transient | environmental | permission | logic | resource
- **Error:** [exact error message]
- **Context:** [what was happening when the error occurred]
- **Action taken:** [retry | escalate | investigate | fix-environment]
- **Outcome:** [resolved | escalated | investigating]

In pipeline mode, append to .factory/audit-trail/slices/<slice-id>/error-log.md. In standalone mode, write to the project’s scratch directory or memory.

Environmental Recovery

For environmental errors (missing tools, wrong versions, port conflicts):

Identify the specific environmental issue
Attempt a targeted fix (install missing dependency, kill conflicting process, clear stale lock file)
Verify the fix resolved the issue
Retry the original operation ONCE
If it fails again, escalate — the environment may need manual intervention

Port conflicts: Check for processes using the port with lsof -i :<port> or equivalent. If the process is not related to the current project, report it to the user rather than killing it.

Context Exhaustion Recovery

When approaching context limits during long operations:

Write current state to WORKING_STATE.md immediately
Complete the current atomic operation if possible
Signal that continuation is needed
Do NOT start new operations that cannot complete in remaining context

This prevents the failure mode of starting work that cannot be finished, leaving the project in an inconsistent state.

Escalation Rules

Escalate to the user when:

Permission errors (you cannot fix what you cannot access)
Logic errors after investigation (the bug needs human insight)
3 retries exhausted for a transient error (the service may be down)
Environmental fix failed (the environment may need manual repair)
Resource exhaustion (context limit, disk space, memory)
Any error you cannot classify (unknown errors are dangerous)

How to escalate: Provide the error category, the exact error message, what you tried, and what you recommend. Do not just say “an error occurred.”

Pipeline Integration

In factory pipeline mode, error recovery integrates with the rework protocol:

Transient errors during CI: auto-retry once (standard/full autonomy)
Build tool crashes: classify and apply the appropriate strategy
Gate failures: these are NOT errors — they are expected feedback from quality gates. Do not apply error recovery to gate failures.

Enforcement Note

This skill provides advisory guidance. It instructs the agent on error classification and recovery strategies but cannot mechanically enforce retry limits or prevent silent error swallowing. The agent follows these practices by convention. If you observe the agent retrying endlessly or ignoring errors, point it out.

Verification

After recovering from an error, verify:

Error was classified before any recovery attempt
Logic errors were NOT retried (investigated instead)
Retries used exponential backoff (not immediate)
No more than 3 retries for the same error
Error was logged with category, message, context, and outcome
Escalation included the error category, message, attempts, and recommendation
Environmental fixes were verified before retrying the operation
State was saved before context exhaustion recovery

If any criterion is not met, revisit the relevant practice.

Dependencies

This skill works standalone. For enhanced workflows, it integrates with:

debugging-protocol: Logic errors escalate to the debugging protocol for systematic investigation
pipeline: Pipeline gate failures are handled by the rework protocol, not error recovery
session-reflection: Recurring errors become system prompt refinements
ci-integration: CI failures classify as transient (infra) or logic (test failure) for appropriate handling

Missing a dependency? Install with:

npx skills add jwilger/agent-skills --skill debugging-protocol

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台