qa-agent-testing

📁 vasilyu1983/ai-agents-public 📅 Jan 23, 2026

总安装量

周安装量

#9272

全站排名

安装命令

npx skills add https://github.com/vasilyu1983/ai-agents-public --skill qa-agent-testing

Agent 安装分布

claude-code 17

antigravity 14

opencode 14

gemini-cli 13

codex 12

Skill 文档

QA Agent Testing (Jan 2026)

Design and run reliable evaluation suites for LLM agents/personas, including tool-using and multi-agent systems.

Default QA Workflow

Define the Persona Under Test (PUT): scope, out-of-scope, and safety boundaries.
Define 10 representative tasks (Must Ace).
Define 5 refusal edge cases (Must Decline + redirect).
Define an output contract (format, tone, structure, citations).
Run the suite with determinism controls and tool tracing.
Score with the 6-dimension rubric; track variance across reruns.
Log baselines and regressions; gate merges/deploys on thresholds.

Use the copy-paste templates in assets/ for day-0 setup.

Determinism and Flake Control

Control inputs: pin prompts/config, fixtures, stable tool responses, frozen time/timezone where possible.
Control sampling: fixed seeds/temperatures where supported; log model/config versions.
Record tool traces: tool name, args, outputs, latency, errors, retries, and side effects.

Two-Layer Evaluation (2026)

Evaluate reasoning and action layers separately:

Layer	What to Test	Key Metrics
Reasoning	Planning, decision-making, intent	Intent resolution, task adhesion, context retention
Action	Tool calls, execution, side effects	Tool call accuracy, completion rate, error recovery

Evaluation Dimensions (Score What Matters)

Dimension	What to Measure	Level
Task success	Correct outcome and constraints met	Agent
Safety/policy	Correct refusals and safe alternatives	Agent
Reliability	Stability across reruns and small prompt changes	Agent
Latency/cost	Budgets per task and per suite	Business
Debuggability	Failures produce evidence (logs, traces)	Agent
Factual grounding	Hallucination rate, citation accuracy	Model
Bias detection	Fairness across demographic inputs	Model

CI Economics

PR gate: small, high-signal smoke eval suite.
Scheduled: full scenario suites, adversarial inputs, and cost/latency regression checks (track separately from quality scoring).

Robustness and Security Tests (Recommended)

Metamorphic tests: run small, meaning-preserving prompt/input rewrites; enforce invariants on outputs.
Prompt injection tests: treat tool outputs, retrieved text, and user-provided documents as untrusted; verify the agent does not follow embedded instructions that conflict with system/developer constraints.
Tool fault injection: simulate timeouts, retries, partial data, and tool errors; verify graceful recovery.
Differential testing: compare behavior across model/config versions for regressions and unexpected shifts.

Do / Avoid

Do:

Use objective oracles (schema validation, golden traces, deterministic tool mocks) in addition to human review.
Quarantine flaky evals with owners and expiry, just like flaky tests in CI.

Avoid:

Evaluating only “happy prompts” with no tool failures and no adversarial inputs.
Letting self-evaluations substitute for ground-truth checks.

Quick Reference

Need	Use	Location
Build the 10 tasks	Task patterns + examples	`references/test-case-design.md`
Design refusals	Refusal categories + templates	`references/refusal-patterns.md`
Score runs	Detailed rubric + thresholds	`references/scoring-rubric.md`
Compute suite math quickly	CLI utility script	`scripts/score_suite.py`
Manage regressions	Re-run workflow + baseline policy	`references/regression-protocol.md`
Sandbox tools	Isolation tiers + hardening	`references/tool-sandboxing.md`
Test multi-agent systems	Coordination patterns + suite template	`references/multi-agent-testing.md`
Use LLM-as-judge safely	Biases + mitigations	`references/llm-judge-limitations.md`
Start from templates	Harness + scoring sheet + log	`assets/`

Decision Tree

Testing an agent?
  - New agent?
    - Create QA harness -> Define 10 tasks + 5 refusals -> Run baseline
  - Prompt changed?
    - Re-run full 15-check suite -> Compare to baseline
  - Tool/knowledge changed?
    - Re-run affected tests -> Log in regression log
  - Quality review?
    - Score against rubric -> Identify weak areas -> Fix prompt

Scoring and Gates

Score each run with the 6-dimension rubric (0-3 each; max 18 per task).
Prefer suite-level gating that accounts for variance; avoid treating non-determinism as a free pass.
Use scripts/score_suite.py to compute averages, normalized scores, and basic PASS/CONDITIONAL/FAIL classification.
For detailed methodology (including judge calibration and variance metrics), see references/scoring-rubric.md.

Navigation

Resources

references/test-case-design.md – 10-task patterns + validation + metamorphic add-ons
references/refusal-patterns.md – refusal categories + response templates + test tactics
references/scoring-rubric.md – scoring guide, thresholds, variance metrics, judge calibration
references/regression-protocol.md – re-run scope, baseline policy, recovery procedures
references/tool-sandboxing.md – sandbox tiers, tool hardening, injection/exfil test ideas
references/multi-agent-testing.md – coordination testing patterns + suite template
references/llm-judge-limitations.md – LLM-as-judge biases, limits, mitigations

Templates

assets/qa-harness-template.md – copy-paste harness
assets/scoring-sheet.md – scoring tracker
assets/regression-log.md – version tracking

External Resources

See data/sources.json for:

LLM evaluation research
Red-teaming methodologies
Prompt testing frameworks

Related Skills

qa-testing-strategy: ../qa-testing-strategy/SKILL.md – General testing strategies
ai-prompt-engineering: ../ai-prompt-engineering/SKILL.md – Prompt design patterns

Quick Start

Copy assets/qa-harness-template.md
Fill in PUT (Persona Under Test) section
Define 10 representative tasks for your agent
Add 5 refusal edge cases
Specify output contracts
Run baseline test
Log results in regression log

Success Criteria: Each of the 10 tasks scores >= 12/18 and each refusal scores >= 2/3 (or PASS by your policy oracle), with stable results across reruns and no new hard failures.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台