hive-test
npx skills add https://github.com/adenhq/hive --skill hive-test
Agent 安装分布
Skill 文档
Agent Testing
Test agents iteratively: execute, analyze failures, fix, resume from checkpoint, repeat.
When to Use
- Testing a newly built agent against its goal
- Debugging a failing agent iteratively
- Verifying fixes without re-running expensive early nodes
- Running final regression tests before deployment
Prerequisites
- Agent package at
exports/{agent_name}/(built with/hive-create) - Credentials configured (
/hive-credentials) ANTHROPIC_API_KEYset (or appropriate LLM provider key)
Path distinction (critical â don’t confuse these):
exports/{agent_name}/â agent source code (edit here)~/.hive/agents/{agent_name}/â runtime data: sessions, checkpoints, logs (read here)
The Iterative Test Loop
This is the core workflow. Don’t re-run the entire agent when a late node fails â analyze, fix, and resume from the last clean checkpoint.
ââââââââââââââââââââââââââââââââââââââââ
â PHASE 1: Generate Test Scenarios â
â Goal â synthetic test inputs + tests â
ââââââââââââââââ¬ââââââââââââââââââââââââ
â
ââââââââââââââââââââââââââââââââââââââââ
â PHASE 2: Execute âââââââââââââââââââ
â Run agent (CLI or pytest) â â
ââââââââââââââââ¬ââââââââââââââââââââââââ â
â â
Pass? ââyesââ⺠PHASE 6: Final Verification â
â â
no â
â â
ââââââââââââââââââââââââââââââââââââââââ â
â PHASE 3: Analyze â â
â Session + runtime logs + checkpoints â â
ââââââââââââââââ¬ââââââââââââââââââââââââ â
â â
ââââââââââââââââââââââââââââââââââââââââ â
â PHASE 4: Fix â â
â Prompt / code / graph / goal â â
ââââââââââââââââ¬ââââââââââââââââââââââââ â
â â
ââââââââââââââââââââââââââââââââââââââââ â
â PHASE 5: Recover & Resume âââââââââââââââââââ
â Checkpoint resume OR fresh re-run â
ââââââââââââââââââââââââââââââââââââââââ
Phase 1: Generate Test Scenarios
Create synthetic tests from the agent’s goal, constraints, and success criteria.
Step 1a: Read the goal
# Read goal from agent.py
Read(file_path="exports/{agent_name}/agent.py")
# Extract the Goal definition and convert to JSON string
Step 1b: Get test guidelines
# Get constraint test guidelines
generate_constraint_tests(
goal_id="your-goal-id",
goal_json='{"id": "...", "constraints": [...]}',
agent_path="exports/{agent_name}"
)
# Get success criteria test guidelines
generate_success_tests(
goal_id="your-goal-id",
goal_json='{"id": "...", "success_criteria": [...]}',
node_names="intake,research,review,report",
tool_names="web_search,web_scrape",
agent_path="exports/{agent_name}"
)
These return file_header, test_template, constraints_formatted/success_criteria_formatted, and test_guidelines. They do NOT generate test code â you write the tests.
Step 1c: Write tests
Write(
file_path=result["output_file"],
content=result["file_header"] + "\n\n" + your_test_code
)
Test writing rules
- Every test MUST be
asyncwith@pytest.mark.asyncio - Every test MUST accept
runner, auto_responder, mock_modefixtures - Use
await auto_responder.start()before running,await auto_responder.stop()infinally - Use
await runner.run(input_dict)â this goes through AgentRunner â AgentRuntime â ExecutionStream - Access output via
result.output.get("key")â NEVERresult.output["key"] result.success=Truemeans no exception, NOT goal achieved â always check output- Write 8-15 tests total, not 30+
- Each real test costs ~3 seconds + LLM tokens
- NEVER use
default_agent.run()â it bypasses the runtime (no sessions, no logs, client-facing nodes hang)
Step 1d: Check existing tests
Before generating, check if tests already exist:
list_tests(
goal_id="your-goal-id",
agent_path="exports/{agent_name}"
)
Phase 2: Execute
Two execution paths, use the right one for your situation.
Iterative debugging (for complex agents)
Run the agent via CLI. This creates sessions with checkpoints at ~/.hive/agents/{agent_name}/sessions/:
uv run hive run exports/{agent_name} --input '{"query": "test topic"}'
Sessions and checkpoints are saved automatically.
Client-facing nodes: Agents with client_facing=True nodes (interactive conversation) work in headless mode when run from a real terminal â the agent streams output to stdout and reads user input from stdin via a >>> prompt. In non-interactive shells (like Claude Code’s Bash tool), client-facing nodes will hang because there is no stdin. For testing interactive agents from Claude Code, use run_tests with mock mode or have the user run the agent manually in their terminal.
Automated regression (for CI or final verification)
Use the run_tests MCP tool to run all pytest tests:
run_tests(
goal_id="your-goal-id",
agent_path="exports/{agent_name}"
)
Returns structured results:
{
"overall_passed": false,
"summary": {"total": 12, "passed": 10, "failed": 2, "pass_rate": "83.3%"},
"test_results": [{"test_name": "test_success_source_diversity", "status": "failed"}],
"failures": [{"test_name": "test_success_source_diversity", "details": "..."}]
}
Options:
# Run only constraint tests
run_tests(goal_id, agent_path, test_types='["constraint"]')
# Stop on first failure
run_tests(goal_id, agent_path, fail_fast=True)
# Parallel execution
run_tests(goal_id, agent_path, parallel=4)
Note: run_tests uses AgentRunner with tmp_path storage, so sessions are isolated per test run. For checkpoint-based recovery with persistent sessions, use CLI execution. Use run_tests for quick regression checks and final verification.
Phase 3: Analyze Failures
When a test fails, drill down systematically. Don’t guess â use the tools.
Step 3a: Get error category
debug_test(
goal_id="your-goal-id",
test_name="test_success_source_diversity",
agent_path="exports/{agent_name}"
)
Returns error category (IMPLEMENTATION_ERROR, ASSERTION_FAILURE, TIMEOUT, IMPORT_ERROR, API_ERROR) plus full traceback and suggestions.
Step 3b: Find the failed session
list_agent_sessions(
agent_work_dir="~/.hive/agents/{agent_name}",
status="failed",
limit=5
)
Returns session list with IDs, timestamps, current_node (where it failed), execution_quality.
Step 3c: Inspect session state
get_agent_session_state(
agent_work_dir="~/.hive/agents/{agent_name}",
session_id="session_20260209_143022_abc12345"
)
Returns execution path, which node was current, step count, timestamps â but excludes memory values (to avoid context bloat). Shows memory_keys and memory_size instead.
Step 3d: Examine runtime logs (L2/L3)
# L2: Per-node success/failure, retry counts
query_runtime_log_details(
agent_work_dir="~/.hive/agents/{agent_name}",
run_id="session_20260209_143022_abc12345",
needs_attention_only=True
)
# L3: Exact LLM responses, tool call inputs/outputs
query_runtime_log_raw(
agent_work_dir="~/.hive/agents/{agent_name}",
run_id="session_20260209_143022_abc12345",
node_id="research"
)
Step 3e: Inspect memory data
# See what data a node actually produced
get_agent_session_memory(
agent_work_dir="~/.hive/agents/{agent_name}",
session_id="session_20260209_143022_abc12345",
key="research_results"
)
Step 3f: Find recovery points
list_agent_checkpoints(
agent_work_dir="~/.hive/agents/{agent_name}",
session_id="session_20260209_143022_abc12345",
is_clean="true"
)
Returns checkpoint summaries with IDs, types (node_start, node_complete), which node, and is_clean flag. Clean checkpoints are safe resume points.
Step 3g: Compare checkpoints (optional)
To understand what changed between two points in execution:
compare_agent_checkpoints(
agent_work_dir="~/.hive/agents/{agent_name}",
session_id="session_20260209_143022_abc12345",
checkpoint_id_before="cp_node_complete_research_143030",
checkpoint_id_after="cp_node_complete_review_143115"
)
Returns memory diff (added/removed/changed keys) and execution path diff.
Phase 4: Fix Based on Root Cause
Use the analysis from Phase 3 to determine what to fix and where.
| Root Cause | What to Fix | Where to Edit |
|---|---|---|
| Prompt issue â LLM produces wrong output format, misses instructions | Node system_prompt |
exports/{agent}/nodes/__init__.py |
| Code bug â TypeError, KeyError, logic error in Python | Agent code | exports/{agent}/agent.py, nodes/__init__.py |
| Graph issue â wrong routing, missing edge, bad condition_expr | Edges, node config | exports/{agent}/agent.py |
| Tool issue â MCP tool fails, wrong config, missing credential | Tool config | exports/{agent}/mcp_servers.json, /hive-credentials |
| Goal issue â success criteria too strict/vague, wrong constraints | Goal definition | exports/{agent}/agent.py (goal section) |
| Test issue â test expectations don’t match actual agent behavior | Test code | exports/{agent}/tests/test_*.py |
Fix strategies by error category
IMPLEMENTATION_ERROR (TypeError, AttributeError, KeyError):
# Read the failing code
Read(file_path="exports/{agent_name}/nodes/__init__.py")
# Fix the bug
Edit(
file_path="exports/{agent_name}/nodes/__init__.py",
old_string="results.get('videos')",
new_string="(results or {}).get('videos', [])"
)
ASSERTION_FAILURE (test assertions fail but agent ran successfully):
- Check if the agent’s output is actually wrong â fix the prompt
- Check if the test’s expectations are unrealistic â fix the test
- Use
get_agent_session_memoryto see what the agent actually produced
TIMEOUT / STALL (agent runs too long):
- Check
node_visit_countsfor feedback loops hitting max_node_visits - Check L3 logs for tool calls that hang
- Reduce
max_iterationsin loop_config or fix the prompt to converge faster
API_ERROR (connection, rate limit, auth):
- Verify credentials with
/hive-credentials - Check MCP server configuration
Phase 5: Recover & Resume
After fixing the agent, decide whether to resume or re-run.
When to resume from checkpoint
Resume when ALL of these are true:
- The fix is to a node that comes AFTER existing clean checkpoints
- Clean checkpoints exist (from a CLI execution with checkpointing)
- The early nodes are expensive (web scraping, API calls, long LLM chains)
# Resume from the last clean checkpoint before the failing node
uv run hive run exports/{agent_name} \
--resume-session session_20260209_143022_abc12345 \
--checkpoint cp_node_complete_research_143030
This skips all nodes before the checkpoint and only re-runs the fixed node onward.
When to re-run from scratch
Re-run when ANY of these are true:
- The fix is to the entry node or an early node
- No checkpoints exist (e.g., agent was run via
run_tests) - The agent is fast (2-3 nodes, completes in seconds)
- You changed the graph structure (added/removed nodes/edges)
uv run hive run exports/{agent_name} --input '{"query": "test topic"}'
Inspecting a checkpoint before resuming
get_agent_checkpoint(
agent_work_dir="~/.hive/agents/{agent_name}",
session_id="session_20260209_143022_abc12345",
checkpoint_id="cp_node_complete_research_143030"
)
Returns the full checkpoint: shared_memory snapshot, execution_path, current_node, next_node, is_clean.
Loop back to Phase 2
After resuming or re-running, check if the fix worked. If not, go back to Phase 3.
Phase 6: Final Verification
Once the iterative fix loop converges (the agent produces correct output), run the full automated test suite:
run_tests(
goal_id="your-goal-id",
agent_path="exports/{agent_name}"
)
All tests should pass. If not, repeat the loop for remaining failures.
Credential Requirements
CRITICAL: Testing requires ALL credentials the agent depends on. This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).
Prerequisites
Before running agent tests, you MUST collect ALL required credentials from the user.
Step 1: LLM API Key (always required)
export ANTHROPIC_API_KEY="your-key-here"
Step 2: Tool-specific credentials (depends on agent’s tools)
Inspect the agent’s mcp_servers.json and tool configuration to determine which tools the agent uses, then check for all required credentials:
from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS
creds = CredentialManager()
# Determine which tools the agent uses (from agent.json or mcp_servers.json)
agent_tools = [...] # e.g., ["hubspot_search_contacts", "web_search", ...]
# Find all missing credentials for those tools
missing = creds.get_missing_for_tools(agent_tools)
Common tool credentials:
| Tool | Env Var | Help URL |
|---|---|---|
| HubSpot CRM | HUBSPOT_ACCESS_TOKEN |
https://developers.hubspot.com/docs/api/private-apps |
| Brave Search | BRAVE_SEARCH_API_KEY |
https://brave.com/search/api/ |
| Google Search | GOOGLE_SEARCH_API_KEY + GOOGLE_SEARCH_CX |
https://developers.google.com/custom-search |
Why ALL credentials are required:
- Tests need to execute the agent’s LLM nodes to validate behavior
- Tools with missing credentials will return error dicts instead of real data
- Mock mode bypasses everything, providing no confidence in real-world performance
Mock Mode Limitations
Mock mode (--mock flag or MOCK_MODE=1) is ONLY for structure validation:
- Validates graph structure (nodes, edges, connections)
- Validates that
AgentRunner.load()succeeds and the agent is importable - Does NOT execute event_loop agents â MockLLMProvider never calls
set_output, so event_loop nodes loop forever - Does NOT test LLM reasoning, content quality, or constraint validation
- Does NOT test real API integrations or tool use
Bottom line: If you’re testing whether an agent achieves its goal, you MUST use real credentials.
Enforcing Credentials in Tests
When writing tests, ALWAYS include credential checks:
import os
import pytest
from aden_tools.credentials import CredentialManager
pytestmark = pytest.mark.skipif(
not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)
@pytest.fixture(scope="session", autouse=True)
def check_credentials():
"""Ensure ALL required credentials are set for real testing."""
creds = CredentialManager()
mock_mode = os.environ.get("MOCK_MODE")
if not creds.is_available("anthropic"):
if mock_mode:
print("\nRunning in MOCK MODE - structure validation only")
else:
pytest.fail(
"\nANTHROPIC_API_KEY not set!\n"
"Set API key: export ANTHROPIC_API_KEY='your-key-here'\n"
"Or run structure validation: MOCK_MODE=1 pytest exports/{agent}/tests/"
)
if not mock_mode:
agent_tools = [] # Update per agent
missing = creds.get_missing_for_tools(agent_tools)
if missing:
lines = ["\nMissing tool credentials!"]
for name in missing:
spec = creds.specs.get(name)
if spec:
lines.append(f" {spec.env_var} - {spec.description}")
pytest.fail("\n".join(lines))
User Communication
When the user asks to test an agent, ALWAYS check for ALL credentials first:
- Identify the agent’s tools from
mcp_servers.json - Check ALL required credentials using
CredentialManager - Ask the user to provide any missing credentials before proceeding
- Collect ALL missing credentials in a single prompt â not one at a time
Safe Test Patterns
OutputCleaner
The framework automatically validates and cleans node outputs using a fast LLM at edge traversal time. Tests should still use safe patterns because OutputCleaner may not catch all issues.
Safe Access (REQUIRED)
# UNSAFE - will crash on missing keys
approval = result.output["approval_decision"]
category = result.output["analysis"]["category"]
# SAFE - use .get() with defaults
output = result.output or {}
approval = output.get("approval_decision", "UNKNOWN")
# SAFE - type check before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
category = analysis.get("category", "unknown")
# SAFE - handle JSON parsing trap (LLM response as string)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
try:
parsed = json.loads(recommendation)
if isinstance(parsed, dict):
approval = parsed.get("approval_decision", "UNKNOWN")
except json.JSONDecodeError:
approval = "UNKNOWN"
elif isinstance(recommendation, dict):
approval = recommendation.get("approval_decision", "UNKNOWN")
# SAFE - type check before iteration
items = output.get("items", [])
if isinstance(items, list):
for item in items:
...
Helper Functions for conftest.py
import json
import re
def _parse_json_from_output(result, key):
"""Parse JSON from agent output (framework may store full LLM response as string)."""
response_text = result.output.get(key, "")
json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()
try:
return json.loads(json_text)
except (json.JSONDecodeError, AttributeError, TypeError):
return result.output.get(key)
def safe_get_nested(result, key_path, default=None):
"""Safely get nested value from result.output."""
output = result.output or {}
current = output
for key in key_path:
if isinstance(current, dict):
current = current.get(key)
elif isinstance(current, str):
try:
json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
parsed = json.loads(json_text)
if isinstance(parsed, dict):
current = parsed.get(key)
else:
return default
except json.JSONDecodeError:
return default
else:
return default
return current if current is not None else default
# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nested
ExecutionResult Fields
result.success=True means NO exception, NOT goal achieved
# WRONG
assert result.success
# RIGHT
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"
All fields:
success: boolâ Completed without exception (NOT goal achieved!)output: dictâ Complete memory snapshot (may contain raw strings)error: str | Noneâ Error message if failedsteps_executed: intâ Number of nodes executedtotal_tokens: intâ Cumulative token usagetotal_latency_ms: intâ Total execution timepath: list[str]â Node IDs traversed (may repeat in feedback loops)paused_at: str | Noneâ Node ID if pausedsession_state: dictâ State for resumingnode_visit_counts: dict[str, int]â Visit counts per node (feedback loop testing)execution_quality: strâ “clean”, “degraded”, or “failed”
Test Count Guidance
Write 8-15 tests, not 30+
- 2-3 tests per success criterion
- 1 happy path test
- 1 boundary/edge case test
- 1 error handling test (optional)
Each real test costs ~3 seconds + LLM tokens. 12 tests = ~36 seconds, $0.12.
Test Patterns
Happy Path
@pytest.mark.asyncio
async def test_happy_path(runner, auto_responder, mock_mode):
"""Test normal successful execution."""
await auto_responder.start()
try:
result = await runner.run({"query": "python tutorials"})
finally:
await auto_responder.stop()
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
assert output.get("report"), "No report produced"
Boundary Condition
@pytest.mark.asyncio
async def test_minimum_sources(runner, auto_responder, mock_mode):
"""Test at minimum source threshold."""
await auto_responder.start()
try:
result = await runner.run({"query": "niche topic"})
finally:
await auto_responder.stop()
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
sources = output.get("sources", [])
if isinstance(sources, list):
assert len(sources) >= 3, f"Expected >= 3 sources, got {len(sources)}"
Error Handling
@pytest.mark.asyncio
async def test_empty_input(runner, auto_responder, mock_mode):
"""Test graceful handling of empty input."""
await auto_responder.start()
try:
result = await runner.run({"query": ""})
finally:
await auto_responder.stop()
# Agent should either fail gracefully or produce an error message
output = result.output or {}
assert not result.success or output.get("error"), "Should handle empty input"
Feedback Loop
@pytest.mark.asyncio
async def test_feedback_loop_terminates(runner, auto_responder, mock_mode):
"""Test that feedback loops don't run forever."""
await auto_responder.start()
try:
result = await runner.run({"query": "test"})
finally:
await auto_responder.stop()
visits = result.node_visit_counts or {}
for node_id, count in visits.items():
assert count <= 5, f"Node {node_id} visited {count} times â possible infinite loop"
MCP Tool Reference
Phase 1: Test Generation
# Check existing tests
list_tests(goal_id, agent_path)
# Get constraint test guidelines (returns templates, NOT generated tests)
generate_constraint_tests(goal_id, goal_json, agent_path)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines
# Get success criteria test guidelines
generate_success_tests(goal_id, goal_json, node_names, tool_names, agent_path)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines
Phase 2: Execution
# Automated regression (no checkpoints, fresh runs)
run_tests(goal_id, agent_path, test_types='["all"]', parallel=-1, fail_fast=False)
# Run only specific test types
run_tests(goal_id, agent_path, test_types='["constraint"]')
run_tests(goal_id, agent_path, test_types='["success"]')
# Iterative debugging with checkpoints (via CLI)
uv run hive run exports/{agent_name} --input '{"query": "test"}'
Phase 3: Analysis
# Debug a specific failed test
debug_test(goal_id, test_name, agent_path)
# Find failed sessions
list_agent_sessions(agent_work_dir, status="failed", limit=5)
# Inspect session state (excludes memory values)
get_agent_session_state(agent_work_dir, session_id)
# Inspect memory data
get_agent_session_memory(agent_work_dir, session_id, key="research_results")
# Runtime logs: L1 summaries
query_runtime_logs(agent_work_dir, status="needs_attention")
# Runtime logs: L2 per-node details
query_runtime_log_details(agent_work_dir, run_id, needs_attention_only=True)
# Runtime logs: L3 tool/LLM raw data
query_runtime_log_raw(agent_work_dir, run_id, node_id="research")
# Find clean checkpoints
list_agent_checkpoints(agent_work_dir, session_id, is_clean="true")
# Compare checkpoints (memory diff)
compare_agent_checkpoints(agent_work_dir, session_id, cp_before, cp_after)
Phase 5: Recovery
# Inspect checkpoint before resuming
get_agent_checkpoint(agent_work_dir, session_id, checkpoint_id)
# Empty checkpoint_id = latest checkpoint
# Resume from checkpoint via CLI (headless)
uv run hive run exports/{agent_name} \
--resume-session {session_id} --checkpoint {checkpoint_id}
Anti-Patterns
| Don’t | Do Instead |
|---|---|
Use default_agent.run() in tests |
Use runner.run() with auto_responder fixtures (goes through AgentRuntime) |
| Re-run entire agent when a late node fails | Resume from last clean checkpoint |
Treat result.success as goal achieved |
Check result.output for actual criteria |
Access result.output["key"] directly |
Use result.output.get("key") |
| Fix random things hoping tests pass | Analyze L2/L3 logs to find root cause first |
| Write 30+ tests | Write 8-15 focused tests |
| Skip credential check | Use /hive-credentials before testing |
Confuse exports/ with ~/.hive/agents/ |
Code in exports/, runtime data in ~/.hive/ |
Use run_tests for iterative debugging |
Use headless CLI with checkpoints for iterative debugging |
| Use headless CLI for final regression | Use run_tests for automated regression |
Use --tui from Claude Code |
Use headless run command â TUI hangs in non-interactive shells |
| Test client-facing nodes from Claude Code | Use mock mode, or have the user run the agent in their terminal |
| Run tests without reading goal first | Always understand the goal before writing tests |
| Skip Phase 3 analysis and guess | Use session + log tools to identify root cause |
Example Walkthrough: Deep Research Agent
A complete iteration showing the test loop for an agent with nodes: intake â research â review â report.
Phase 1: Generate tests
# Read the goal
Read(file_path="exports/deep_research_agent/agent.py")
# Get success criteria test guidelines
result = generate_success_tests(
goal_id="rigorous-interactive-research",
goal_json='{"id": "rigorous-interactive-research", "success_criteria": [{"id": "source-diversity", "target": ">=5"}, {"id": "citation-coverage", "target": "100%"}, {"id": "report-completeness", "target": "90%"}]}',
node_names="intake,research,review,report",
tool_names="web_search,web_scrape",
agent_path="exports/deep_research_agent"
)
# Write tests
Write(
file_path=result["output_file"],
content=result["file_header"] + "\n\n" + test_code
)
Phase 2: First execution
run_tests(
goal_id="rigorous-interactive-research",
agent_path="exports/deep_research_agent",
fail_fast=True
)
Result: test_success_source_diversity fails â agent only found 2 sources instead of 5.
Phase 3: Analyze
# Debug the failing test
debug_test(
goal_id="rigorous-interactive-research",
test_name="test_success_source_diversity",
agent_path="exports/deep_research_agent"
)
# â ASSERTION_FAILURE: Expected >= 5 sources, got 2
# Find the session
list_agent_sessions(
agent_work_dir="~/.hive/agents/deep_research_agent",
status="completed",
limit=1
)
# â session_20260209_150000_abc12345
# See what the research node produced
get_agent_session_memory(
agent_work_dir="~/.hive/agents/deep_research_agent",
session_id="session_20260209_150000_abc12345",
key="research_results"
)
# â Only 2 web_search calls made, each returned 1 source
# Check the LLM's behavior in the research node
query_runtime_log_raw(
agent_work_dir="~/.hive/agents/deep_research_agent",
run_id="session_20260209_150000_abc12345",
node_id="research"
)
# â LLM called web_search only twice, then called set_output
Root cause: The research node’s prompt doesn’t tell the LLM to search for at least 5 diverse sources. It stops after the first couple of searches.
Phase 4: Fix the prompt
Read(file_path="exports/deep_research_agent/nodes/__init__.py")
Edit(
file_path="exports/deep_research_agent/nodes/__init__.py",
old_string='system_prompt="Search for information on the user\'s topic."',
new_string='system_prompt="Search for information on the user\'s topic. You MUST find at least 5 diverse, authoritative sources. Use multiple different search queries to ensure source diversity. Do not stop searching until you have at least 5 distinct sources."'
)
Phase 5: Resume from checkpoint
For this example, the fix is to the research node. If we had run via CLI with checkpointing, we could resume from the checkpoint after intake to skip re-running intake:
# Check if clean checkpoint exists after intake
list_agent_checkpoints(
agent_work_dir="~/.hive/agents/deep_research_agent",
session_id="session_20260209_150000_abc12345",
is_clean="true"
)
# â cp_node_complete_intake_150005
# Resume from after intake, re-run research with fixed prompt
uv run hive run exports/deep_research_agent \
--resume-session session_20260209_150000_abc12345 \
--checkpoint cp_node_complete_intake_150005
Or for this simple case (intake is fast), just re-run:
uv run hive run exports/deep_research_agent --input '{"topic": "test"}'
Phase 6: Final verification
run_tests(
goal_id="rigorous-interactive-research",
agent_path="exports/deep_research_agent"
)
# â All 12 tests pass
Test File Structure
exports/{agent_name}/
âââ agent.py â Agent to test (goal, nodes, edges)
âââ nodes/__init__.py â Node implementations (prompts, config)
âââ config.py â Agent configuration
âââ mcp_servers.json â Tool server config
âââ tests/
âââ conftest.py â Shared fixtures + safe access helpers
âââ test_constraints.py â Constraint tests
âââ test_success_criteria.py â Success criteria tests
âââ test_edge_cases.py â Edge case tests
Integration with Other Skills
| Scenario | From | To | Action |
|---|---|---|---|
| Agent built, ready to test | /hive-create |
/hive-test |
Generate tests, start loop |
| Prompt fix needed | /hive-test Phase 4 |
Direct edit | Edit nodes/__init__.py, resume |
| Goal definition wrong | /hive-test Phase 4 |
/hive-create |
Update goal, may need rebuild |
| Missing credentials | /hive-test Phase 3 |
/hive-credentials |
Set up credentials |
| Complex runtime failure | /hive-test Phase 3 |
/hive-debugger |
Deep L1/L2/L3 analysis |
| All tests pass | /hive-test Phase 6 |
Done | Agent validated |