healer
3
总安装量
3
周安装量
#60012
全站排名
安装命令
npx skills add https://github.com/5dlabs/cto --skill healer
Agent 安装分布
claude-code
2
windsurf
1
trae
1
opencode
1
antigravity
1
Skill 文档
Healer Skill
Healer is the observability and self-healing layer for CTO Play workflows. It monitors pod logs via Loki, detects issues, and orchestrates remediations.
When to Use
- Monitoring Play workflow execution
- Debugging agent failures (pre-flight, runtime)
- Understanding detection patterns (A10, A11, A12)
- Checking session status
Healer API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET | Health check |
/api/v1/session/start |
POST | MCP calls this on play() |
/api/v1/session/{play_id} |
GET | Get session details |
/api/v1/sessions |
GET | List all sessions |
/api/v1/sessions/active |
GET | List active sessions only |
Check Active Sessions
curl http://localhost:8083/api/v1/sessions/active | jq
Detection Patterns
Priority 1: Pre-Flight Failures (within 60s of agent start)
| Pattern | Alert Code | Meaning |
|---|---|---|
tool inventory mismatch |
A10 | Agent missing declared tools |
Tool inventory MISMATCH |
A10 | Specific tool unavailable |
declared tools.*missing |
A10 | Tools in config not in CLI |
cto-config.*(missing|invalid) |
A11 | Config not loaded/synced |
mcp.*failed to initialize |
A12 | MCP server init failure |
tools-server.*unreachable |
A12 | Tools-server down |
Priority 2: Runtime Failures
| Pattern | Severity | Action |
|---|---|---|
panicked at, fatal error |
Critical | Immediate escalation |
timeout, connection refused |
High | Infrastructure issue |
max retries exceeded |
High | Agent exhausted attempts |
permission denied.*filesystem |
Critical | Can’t read/write files |
unauthorized|invalid token |
Critical | Auth broken |
Priority 3: Lifecycle Issues
| Pattern | Meaning |
|---|---|
template not found |
Prompt template missing |
prompt.*missing |
Agent instructions not loaded |
role.*undefined |
Agent role not set |
task context.*empty |
Task details not injected |
Dual-Model Architecture
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â DUAL-MODEL HEALER ARCHITECTURE â
â â
â DATA SOURCES â
â ââ Loki (all pod logs) â
â ââ Kubernetes (CodeRuns, Pods, Events) â
â ââ GitHub (PRs, comments, CI status) â
â ââ CTO Config (expected tools, agent settings) â
â â â
â â¼ â
â MODEL 1: EVALUATION AGENT â
â ââ Parses and comprehends ALL logs â
â ââ Correlates events across agents â
â ââ Identifies root cause â
â ââ Creates GitHub Issue with analysis â
â â â
â â¼ â
â MODEL 2: REMEDIATION AGENT â
â ââ Reads the GitHub issue â
â ââ Implements the fix â
â ââ Creates PR with changes â
â ââ Marks issue resolved â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Session Notification Flow
MCP play() call
â
â¼
POST /api/v1/session/start
â
ââ Payload: {
play_id,
repository,
cto_config: { agents, tools },
tasks: [...]
}
â
â¼
Healer stores session with expected tools per agent
â
â¼
CodeRuns start with Healer already aware
Watch Logs
Pod Logs
# Watch all CTO pods
kubectl logs -n cto -l app.kubernetes.io/part-of=cto -f --tail=100
# Watch specific agent CodeRun
kubectl logs -n cto -l app=coderun -f
Loki Query
{namespace="cto"} |= "error" | json
Pre-Flight Checklist (Verify within 60s)
For every agent run, Healer verifies:
Prompts
- Agent type identified
- Role matches task
- Template loaded
- Language context set
MCP Tools (from CTO Config)
- CTO config loaded
- Remote tools accessible
- Local servers initialized
- Tools-server reachable
Escalation
When issues detected:
- Evaluation Agent creates GitHub issue with root cause
- Remediation Agent attempts fix (if automatable)
- Discord notification for P0/P1 critical issues
- Human escalation if remediation fails
Configuration
In cto-config.json:
{
"defaults": {
"play": {
"healerEndpoint": "http://localhost:8083"
},
"remediation": {
"maxIterations": 3,
"syncTimeoutSecs": 300
}
}
}
Reference Documentation
- docs/heal-play.md – Full Healer specification
- crates/healer/ – Healer implementation
- crates/healer/src/scanner.rs – Detection patterns