ai-evals
npx skills add https://github.com/liqiongyu/lenny_skills_plus --skill ai-evals
Agent 安装分布
Skill 文档
AI Evals
Scope
Covers
- Designing evaluation (âevalsâ) for LLM/AI features as an execution contract: what âgoodâ means and how itâs measured
- Converting failures into a golden test set + error taxonomy + rubric
- Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
- Producing decision-ready results and an iteration loop (every bug becomes a new test)
When to use
- âDesign evals for this LLM feature so we can ship with confidence.â
- âCreate a rubric + golden set + benchmark for our AI assistant/copilot.â
- âWeâre seeing flaky qualityâdo error analysis and turn it into a repeatable eval.â
- âCompare prompts/models safely with a clear acceptance threshold.â
When NOT to use
- You need to decide what to build (use
problem-definition,building-with-llms, orai-product-strategy). - Youâre primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
- You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).
- You only want vendor/model selection with no defined task + data (use
evaluating-new-technologyfirst, then come back with a concrete use case).
Inputs
Minimum required
- System under test (SUT): what the AI does, for whom, in what workflow (inputs â outputs)
- The decision the eval must support (ship/no-ship, compare options, regression gate)
- What âgoodâ means: 3â10 target behaviors + top failure modes
- Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline
Missing-info strategy
- Ask up to 5 questions from references/INTAKE.md (3â5 at a time).
- If details remain missing, proceed with explicit assumptions and provide 2â3 viable options (judge type, scoring scheme, dataset size).
- If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).
Outputs (deliverables)
Produce an AI Evals Pack (in chat; or as files if requested), in this order:
- Eval PRD (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds
- Test set spec + initial golden set: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)
- Error taxonomy (from error analysis + open coding): failure modes, severity, examples
- Rubric + judging guide: dimensions, scoring scale, definitions, examples, tie-breakers
- Judge + harness plan: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate
- Reporting + iteration loop: baseline results format, regression policy, how new bugs become new tests
- Risks / Open questions / Next steps (always included)
Templates: references/TEMPLATES.md
Workflow (7 steps)
1) Define the decision and write the Eval PRD
- Inputs: SUT description, stakeholders, decision to support.
- Actions: Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.
- Outputs: Draft Eval PRD (template in references/TEMPLATES.md).
- Checks: A stakeholder can restate what is being measured, why, and what âpassâ means.
2) Draft the golden set structure + coverage plan
- Inputs: User workflows, edge cases, safety risks, data availability.
- Actions: Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).
- Outputs: Test set spec + initial golden set.
- Checks: Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.
3) Run error analysis and open coding to build a taxonomy
- Inputs: Known failures, logs, stakeholder anecdotes, initial golden set.
- Actions: Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).
- Outputs: Error taxonomy + âtop failure modesâ list.
- Checks: Taxonomy is mutually understandable by PM/eng; each category has 1â2 concrete examples.
4) Convert taxonomy â rubric + scoring rules
- Inputs: Taxonomy, target behaviors, output formats.
- Actions: Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.
- Outputs: Rubric + judging guide.
- Checks: Two independent judges would likely score the same case similarly (instructions are specific, not vibes).
5) Choose the judging approach + harness/runbook
- Inputs: Constraints (time/cost), required reliability, privacy/safety constraints.
- Actions: Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.
- Outputs: Judge + harness plan.
- Checks: The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).
6) Define reporting, thresholds, and the iteration loop
- Inputs: Stakeholder needs, release cadence.
- Actions: Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.
- Outputs: Reporting + iteration loop.
- Checks: A reader can make a decision from the report without additional meetings; regressions are detectable.
7) Quality gate + finalize
- Inputs: Full draft pack.
- Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include Risks / Open questions / Next steps.
- Outputs: Final AI Evals Pack.
- Checks: The eval definition functions as a product requirement: clear, testable, and actionable.
Quality gate (required)
- Use references/CHECKLISTS.md and references/RUBRIC.md.
- Always include: Risks, Open questions, Next steps.
Examples
Example 1 (answer quality + safety): âUse ai-evals to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.â
Example 2 (structured extraction): âUse ai-evals to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for amount and due_date. Output: AI Evals Pack.â
Boundary example: âWe donât know what the AI feature should do yetâjust âadd AIâ and pick a model.â
Response: out of scope; first define the job/spec and success metrics (use problem-definition or building-with-llms), then return to ai-evals with a concrete SUT.