ai-evals

📁 liqiongyu/lenny_skills_plus 📅 Jan 26, 2026

总安装量

周安装量

#56692

全站排名

安装命令

npx skills add https://github.com/liqiongyu/lenny_skills_plus --skill ai-evals

Agent 安装分布

claude-code 3

codex 3

opencode 3

gemini-cli 3

qoder 3

trae 3

Skill 文档

AI Evals

Scope

Covers

Designing evaluation (âevalsâ) for LLM/AI features as an execution contract: what âgoodâ means and how itâs measured
Converting failures into a golden test set + error taxonomy + rubric
Choosing a judging approach (human, LLM-as-judge, automated checks) and a repeatable harness/runbook
Producing decision-ready results and an iteration loop (every bug becomes a new test)

When to use

âDesign evals for this LLM feature so we can ship with confidence.â
âCreate a rubric + golden set + benchmark for our AI assistant/copilot.â
âWeâre seeing flaky qualityâdo error analysis and turn it into a repeatable eval.â
âCompare prompts/models safely with a clear acceptance threshold.â

When NOT to use

You need to decide what to build (use problem-definition, building-with-llms, or ai-product-strategy).
Youâre primarily doing traditional non-LLM software testing (use your standard eng QA/unit/integration tests).
You want model training research or infra design (this skill assumes API/model usage; delegate to ML/infra).
You only want vendor/model selection with no defined task + data (use evaluating-new-technology first, then come back with a concrete use case).

Inputs

Minimum required

System under test (SUT): what the AI does, for whom, in what workflow (inputs â outputs)
The decision the eval must support (ship/no-ship, compare options, regression gate)
What âgoodâ means: 3â10 target behaviors + top failure modes
Constraints: privacy/compliance, safety policy, languages, cost/latency budgets, timeline

Missing-info strategy

Ask up to 5 questions from references/INTAKE.md (3â5 at a time).
If details remain missing, proceed with explicit assumptions and provide 2â3 viable options (judge type, scoring scheme, dataset size).
If asked to run code or generate datasets from sensitive sources, request confirmation and apply least privilege (no secrets; redact/anonymize).

Outputs (deliverables)

Produce an AI Evals Pack (in chat; or as files if requested), in this order:

Eval PRD (evaluation requirements): decision, scope, target behaviors, success metrics, acceptance thresholds
Test set spec + initial golden set: schema, coverage plan, and a starter set of cases (tagged by scenario/risk)
Error taxonomy (from error analysis + open coding): failure modes, severity, examples
Rubric + judging guide: dimensions, scoring scale, definitions, examples, tie-breakers
Judge + harness plan: human vs LLM-as-judge vs automated checks, prompts/instructions, calibration, runbook, cost/time estimate
Reporting + iteration loop: baseline results format, regression policy, how new bugs become new tests
Risks / Open questions / Next steps (always included)

Templates: references/TEMPLATES.md

Workflow (7 steps)

1) Define the decision and write the Eval PRD

Inputs: SUT description, stakeholders, decision to support.
Actions: Define the decision (ship/no-ship, compare A vs B), scope/non-goals, target behaviors, acceptance thresholds, and what must never happen.
Outputs: Draft Eval PRD (template in references/TEMPLATES.md).
Checks: A stakeholder can restate what is being measured, why, and what âpassâ means.

2) Draft the golden set structure + coverage plan

Inputs: User workflows, edge cases, safety risks, data availability.
Actions: Specify the test case schema, tagging, and coverage targets (happy paths, tricky paths, adversarial/safety, long-tail). Create an initial starter set (small but high-signal).
Outputs: Test set spec + initial golden set.
Checks: Every target behavior has at least 2 test cases; high-severity risks are explicitly represented.

3) Run error analysis and open coding to build a taxonomy

Inputs: Known failures, logs, stakeholder anecdotes, initial golden set.
Actions: Review failures, label them with open coding, consolidate into a taxonomy, and assign severity/impact. Identify likely root causes (prompting, missing context, tool misuse, formatting, policy).
Outputs: Error taxonomy + âtop failure modesâ list.
Checks: Taxonomy is mutually understandable by PM/eng; each category has 1â2 concrete examples.

4) Convert taxonomy â rubric + scoring rules

Inputs: Taxonomy, target behaviors, output formats.
Actions: Define scoring dimensions and scales; write clear judge instructions and tie-breakers; add examples and disallowed behaviors. Decide absolute scoring vs pairwise comparisons.
Outputs: Rubric + judging guide.
Checks: Two independent judges would likely score the same case similarly (instructions are specific, not vibes).

5) Choose the judging approach + harness/runbook

Inputs: Constraints (time/cost), required reliability, privacy/safety constraints.
Actions: Pick judge type(s): human, LLM-as-judge, automated checks. Define calibration (gold examples, inter-rater checks), sampling, and how results are stored. Write a runbook with estimated runtime/cost.
Outputs: Judge + harness plan.
Checks: The plan is repeatable (versioned prompts/models, deterministic settings where possible, clear data handling).

6) Define reporting, thresholds, and the iteration loop

Inputs: Stakeholder needs, release cadence.
Actions: Specify report format (overall + per-tag metrics), regression rules, and what changes require re-running evals. Define the iteration loop: every discovered failure becomes a new test + taxonomy update.
Outputs: Reporting + iteration loop.
Checks: A reader can make a decision from the report without additional meetings; regressions are detectable.

7) Quality gate + finalize

Inputs: Full draft pack.
Actions: Run references/CHECKLISTS.md and score with references/RUBRIC.md. Fix missing coverage, vague rubric language, or non-repeatable harness steps. Always include Risks / Open questions / Next steps.
Outputs: Final AI Evals Pack.
Checks: The eval definition functions as a product requirement: clear, testable, and actionable.

Quality gate (required)

Use references/CHECKLISTS.md and references/RUBRIC.md.
Always include: Risks, Open questions, Next steps.

Examples

Example 1 (answer quality + safety): âUse ai-evals to design evals for a customer-support reply drafting assistant. Constraints: no PII leakage, must cite KB articles, and must refuse unsafe requests. Output: AI Evals Pack.â

Example 2 (structured extraction): âUse ai-evals to create a rubric + golden set for an LLM that extracts invoice fields to JSON. Constraints: must always return valid JSON; prioritize recall for amount and due_date. Output: AI Evals Pack.â

Boundary example: âWe donât know what the AI feature should do yetâjust âadd AIâ and pick a model.â
Response: out of scope; first define the job/spec and success metrics (use problem-definition or building-with-llms), then return to ai-evals with a concrete SUT.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台