langgraph-error-handling

📁 lubu-labs/langchain-agent-skills 📅 12 days ago
9
总安装量
8
周安装量
#31490
全站排名
安装命令
npx skills add https://github.com/lubu-labs/langchain-agent-skills --skill langgraph-error-handling

Agent 安装分布

opencode 8
gemini-cli 8
github-copilot 8
codex 8
amp 7
kimi-cli 7

Skill 文档

LangGraph Error Handling

Use This Skill For

  • Adding RetryPolicy to flaky nodes (API, DB, model/tool calls)
  • Designing LLM recovery loops (Command + error state + retry counters)
  • Adding human approval/escalation with interrupt() and resume
  • Handling prebuilt ToolNode failures
  • Debugging transactional failure behavior in parallel supersteps

Strategy Selection

Use this order:

  1. Transient/infrastructure issue (429, timeout, 5xx, temporary DB lock) -> RetryPolicy
  2. Recoverable by model/tool args correction -> store error in state and route back with Command
  3. Needs user approval or missing info -> interrupt() + resume
  4. Unknown/programming bug -> let it bubble up and debug
Error Type Owner Primary Mechanism
Transient System RetryPolicy
LLM-recoverable LLM State update + Command(goto=...)
User-fixable Human interrupt() + Command(resume=...)
Unexpected Developer Raise/log/debug

For full taxonomy, load references/error-types.md.

Minimal Patterns

1) Retry Transient Failures

from langgraph.types import RetryPolicy

builder.add_node(
    "call_api",
    call_api,
    retry_policy=RetryPolicy(max_attempts=3, initial_interval=1.0),
)
builder.addNode("callApi", callApi, {
  retryPolicy: { maxAttempts: 3, initialInterval: 1.0 },
});

Notes:

  • Python and JS default retry behavior differs by exception type.
  • Prefer targeted retry_on/retryOn for non-transient domains.

2) LLM Recovery Loop

Use MessagesState in Python for message state.

from typing import Literal
from typing_extensions import NotRequired
from langgraph.graph import MessagesState
from langgraph.types import Command

class State(MessagesState):
    error: NotRequired[str]
    retry_count: NotRequired[int]

def agent(state: State) -> Command[Literal["tool", "__end__"]]:
    if state.get("retry_count", 0) >= 3:
        return Command(goto="__end__")
    if state.get("error"):
        return Command(goto="tool")
    return Command(goto="tool")
import { StateGraph, Command, END } from "@langchain/langgraph";

// If a node returns Command in JS, add `ends` on addNode.
builder.addNode("agent", agentNode, { ends: ["tool", END] });

3) Human-In-The-Loop Escalation

from langgraph.types import interrupt, Command

def human_review(state):
    approved = interrupt({
        "question": "Proceed?",
        "payload": state["pending_action"],
    })
    return Command(goto="execute" if approved else "cancel")

# resume
graph.invoke(Command(resume=True), config={"configurable": {"thread_id": "t-1"}})
import { Command, interrupt } from "@langchain/langgraph";

const approved = interrupt({ question: "Proceed?" });
// later
await graph.invoke(new Command({ resume: true }), {
  configurable: { thread_id: "t-1" },
});

Requirements:

  • Compile with a checkpointer for interrupt flows.
  • Reuse the same thread_id on resume.

For deep HITL patterns, load references/human-escalation.md.

ToolNode Error Handling

from langgraph.prebuilt import ToolNode

tool_node = ToolNode(tools, handle_tool_errors=True)
tool_node = ToolNode(tools, handle_tool_errors="Please try again.")
tool_node = ToolNode(tools, handle_tool_errors=(ValueError, TypeError))

Use custom handlers when you need deterministic error shaping for model recovery. For broader tool-recovery design, load references/llm-recovery.md.

Critical Behavior (Do Not Skip)

  1. Supersteps are transactional: one failing parallel branch fails the whole superstep state update.
  2. RetryPolicy retries failing branches, not successful siblings.
  3. interrupt() re-runs the node on resume: side effects before interrupt must be idempotent, or moved after interrupt / separate node.
  4. JS Command routing requires ends metadata on addNode(...).
  5. Use explicit retry limits (max_attempts, plus state counters for recovery loops).

Local Assets In This Skill

Scripts

  • scripts/classify_error.py: classify exception category and recommended handling
  • scripts/wrap_with_retry.py: generate boilerplate node wrappers with retry/recovery/escalation options

Run from repo root:

uv run skills/langgraph-error-handling/scripts/classify_error.py TimeoutError --verbose
uv run skills/langgraph-error-handling/scripts/wrap_with_retry.py call_llm --with-llm-recovery

Examples

  • assets/examples/retry-example/: retry + recovery loop (Python and JS)
  • assets/examples/human-loop-example/: interrupt/resume approval flow (Python and JS)

Load References On Demand

  • references/error-types.md: error taxonomy and classification rules
  • references/retry-strategies.md: retry tuning, backoff, circuit-breaker-style patterns
  • references/llm-recovery.md: recovery-loop and ToolNode strategies
  • references/human-escalation.md: human approval, interrupts, and escalation patterns

Common Failure Modes

Symptom Root Cause Fix
interrupt() fails at runtime no checkpointer compile with checkpointer
Resume starts new run different thread_id reuse same thread_id
JS Command route not taken missing ends add ends to addNode
Infinite loop no termination counter/condition add retry counter + terminal branch
Retry never triggers exception excluded by retry filter set explicit retry_on/retryOn