qa-agent-testing
27
总安装量
27
周安装量
#7441
全站排名
安装命令
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill qa-agent-testing
Agent 安装分布
claude-code
17
antigravity
14
opencode
14
gemini-cli
13
codex
12
Skill 文档
QA Agent Testing (Jan 2026)
Design and run reliable evaluation suites for LLM agents/personas, including tool-using and multi-agent systems.
Default QA Workflow
- Define the Persona Under Test (PUT): scope, out-of-scope, and safety boundaries.
- Define 10 representative tasks (Must Ace).
- Define 5 refusal edge cases (Must Decline + redirect).
- Define an output contract (format, tone, structure, citations).
- Run the suite with determinism controls and tool tracing.
- Score with the 6-dimension rubric; track variance across reruns.
- Log baselines and regressions; gate merges/deploys on thresholds.
Use the copy-paste templates in assets/ for day-0 setup.
Determinism and Flake Control
- Control inputs: pin prompts/config, fixtures, stable tool responses, frozen time/timezone where possible.
- Control sampling: fixed seeds/temperatures where supported; log model/config versions.
- Record tool traces: tool name, args, outputs, latency, errors, retries, and side effects.
Two-Layer Evaluation (2026)
Evaluate reasoning and action layers separately:
| Layer | What to Test | Key Metrics |
|---|---|---|
| Reasoning | Planning, decision-making, intent | Intent resolution, task adhesion, context retention |
| Action | Tool calls, execution, side effects | Tool call accuracy, completion rate, error recovery |
Evaluation Dimensions (Score What Matters)
| Dimension | What to Measure | Level |
|---|---|---|
| Task success | Correct outcome and constraints met | Agent |
| Safety/policy | Correct refusals and safe alternatives | Agent |
| Reliability | Stability across reruns and small prompt changes | Agent |
| Latency/cost | Budgets per task and per suite | Business |
| Debuggability | Failures produce evidence (logs, traces) | Agent |
| Factual grounding | Hallucination rate, citation accuracy | Model |
| Bias detection | Fairness across demographic inputs | Model |
CI Economics
- PR gate: small, high-signal smoke eval suite.
- Scheduled: full scenario suites, adversarial inputs, and cost/latency regression checks (track separately from quality scoring).
Robustness and Security Tests (Recommended)
- Metamorphic tests: run small, meaning-preserving prompt/input rewrites; enforce invariants on outputs.
- Prompt injection tests: treat tool outputs, retrieved text, and user-provided documents as untrusted; verify the agent does not follow embedded instructions that conflict with system/developer constraints.
- Tool fault injection: simulate timeouts, retries, partial data, and tool errors; verify graceful recovery.
- Differential testing: compare behavior across model/config versions for regressions and unexpected shifts.
Do / Avoid
Do:
- Use objective oracles (schema validation, golden traces, deterministic tool mocks) in addition to human review.
- Quarantine flaky evals with owners and expiry, just like flaky tests in CI.
Avoid:
- Evaluating only “happy prompts” with no tool failures and no adversarial inputs.
- Letting self-evaluations substitute for ground-truth checks.
Quick Reference
| Need | Use | Location |
|---|---|---|
| Build the 10 tasks | Task patterns + examples | references/test-case-design.md |
| Design refusals | Refusal categories + templates | references/refusal-patterns.md |
| Score runs | Detailed rubric + thresholds | references/scoring-rubric.md |
| Compute suite math quickly | CLI utility script | scripts/score_suite.py |
| Manage regressions | Re-run workflow + baseline policy | references/regression-protocol.md |
| Sandbox tools | Isolation tiers + hardening | references/tool-sandboxing.md |
| Test multi-agent systems | Coordination patterns + suite template | references/multi-agent-testing.md |
| Use LLM-as-judge safely | Biases + mitigations | references/llm-judge-limitations.md |
| Start from templates | Harness + scoring sheet + log | assets/ |
Decision Tree
Testing an agent?
- New agent?
- Create QA harness -> Define 10 tasks + 5 refusals -> Run baseline
- Prompt changed?
- Re-run full 15-check suite -> Compare to baseline
- Tool/knowledge changed?
- Re-run affected tests -> Log in regression log
- Quality review?
- Score against rubric -> Identify weak areas -> Fix prompt
Scoring and Gates
- Score each run with the 6-dimension rubric (0-3 each; max 18 per task).
- Prefer suite-level gating that accounts for variance; avoid treating non-determinism as a free pass.
- Use
scripts/score_suite.pyto compute averages, normalized scores, and basic PASS/CONDITIONAL/FAIL classification. - For detailed methodology (including judge calibration and variance metrics), see
references/scoring-rubric.md.
Navigation
Resources
references/test-case-design.md– 10-task patterns + validation + metamorphic add-onsreferences/refusal-patterns.md– refusal categories + response templates + test tacticsreferences/scoring-rubric.md– scoring guide, thresholds, variance metrics, judge calibrationreferences/regression-protocol.md– re-run scope, baseline policy, recovery proceduresreferences/tool-sandboxing.md– sandbox tiers, tool hardening, injection/exfil test ideasreferences/multi-agent-testing.md– coordination testing patterns + suite templatereferences/llm-judge-limitations.md– LLM-as-judge biases, limits, mitigations
Templates
assets/qa-harness-template.md– copy-paste harnessassets/scoring-sheet.md– scoring trackerassets/regression-log.md– version tracking
External Resources
See data/sources.json for:
- LLM evaluation research
- Red-teaming methodologies
- Prompt testing frameworks
Related Skills
- qa-testing-strategy: ../qa-testing-strategy/SKILL.md – General testing strategies
- ai-prompt-engineering: ../ai-prompt-engineering/SKILL.md – Prompt design patterns
Quick Start
- Copy assets/qa-harness-template.md
- Fill in PUT (Persona Under Test) section
- Define 10 representative tasks for your agent
- Add 5 refusal edge cases
- Specify output contracts
- Run baseline test
- Log results in regression log
Success Criteria: Each of the 10 tasks scores >= 12/18 and each refusal scores >= 2/3 (or PASS by your policy oracle), with stable results across reruns and no new hard failures.