skill-evaluation
npx skills add https://github.com/williamhallatt/cogworks --skill skill-evaluation
Agent 安装分布
Skill 文档
Skill Evaluation Expert
When invoked, you operate with specialized knowledge in evaluating Claude Code skills systematically.
This expertise synthesizes evaluation methodologies from Anthropic and OpenAI into a unified framework. Where the sources disagree, Anthropic guidance takes precedence for Claude-specific concerns.
Knowledge Base Summary
- Define before building: Write SMART success criteria (Specific, Measurable, Achievable, Relevant) across multiple dimensions before touching any skill code — the eval is the specification
- Four-category test datasets: Explicit triggers, implicit triggers, contextual triggers, and negative controls (~25%) prevent both missed activations and false activations
- Layer grading by cost: Deterministic checks first (fast, cheap, unambiguous), LLM-as-judge second (moderate cost, high nuance), human evaluation only for calibration
- Observable behavior over text quality: Grade what the skill makes Claude do (commands, tools, files, sequence) not what it makes Claude say
- Volume beats perfection: 100 automated tests with 80% grading accuracy catch more failures than 10 hand-graded perfect tests
- Expand from reality: Start with 10-20 test cases, grow from real production failures, not speculative edge cases
Core Philosophy
Observable behavior is ground truth. A skill that produces eloquent text while suggesting dangerous commands is failing. Grade execution traces — commands run, tools invoked, files modified, step sequence — before assessing text quality. Text quality is secondary and should only be evaluated after behavior passes.
Negative controls are non-negotiable. False activations (skill triggers when it should not) erode user trust faster than missed activations (skill does not trigger when it should). Every test dataset must include ~25% negative controls.
Calibrate your judges. LLM-as-judge achieves 80%+ human agreement but has systematic biases (verbosity preference, position bias, self-preference). Validate against human judgments before trusting LLM-based grading at scale.
Quick Decision Framework
Which grader should I use?
- Deterministic: Binary facts (string presence, command executed, file exists, JSON valid)
- LLM-as-judge: Qualitative assessment (style, clarity, convention adherence, approach quality)
- Human: Calibration samples (20-50 cases), disputed cases, safety-critical final validation
How many test cases do I need?
- Initial: 10-20 cases (core scenarios + negative controls)
- Per production failure: +3-5 cases (the failure + variations)
- Mature production skill: 100+ cases
What makes success criteria good?
- Specific metrics with thresholds (“F1 >= 0.85”, “false positive rate <= 5%”)
- Multiple dimensions (task fidelity, safety, latency, cost)
- Based on current Claude capabilities (achievable)
- NOT vague goals (“works well”, “good performance”)
Full Knowledge Base
Core knowledge in reference.md:
- Core Concepts – 8 definitions with cross-source synthesis
- Concept Map – 15 explicit relationships
- Deep Dives – Negative controls, LLM judge calibration, execution traces, volume vs perfection
- Quick Reference – Checklists, thresholds, sizing guidance
Patterns and examples in separate files (loaded on-demand):
- patterns.md – 7 reusable patterns + 5 anti-patterns with when/why/how
- examples.md – 6 practical examples with code and citations
Writing Evaluation Plans
When helping users create evals, follow this structure:
1. Define Success Criteria (SMART)
Specific: What exact behavior/output is expected?
Measurable: What metric with what threshold?
Achievable: Based on Claude's current capabilities?
Relevant: Aligned with skill's purpose?
Multidimensional: Covers accuracy + safety + latency + cost?
2. Design Test Dataset
Explicit triggers: [N] direct skill invocations (~50-60%)
Implicit triggers: [N] indirect invocations (~15-20%)
Contextual triggers: [N] environment-dependent cases (~10-15%)
Negative controls: [N] skill should NOT activate (~25%)
Edge cases: [N] per relevant taxonomy category (2-3 each)
3. Choose Graders (Layered)
Layer 1 - Deterministic: What binary facts can be checked? (always run)
Layer 2 - LLM-as-judge: What needs qualitative rubric? (only if Layer 1 passes)
Layer 3 - Human: What sample for calibration? (10-20 cases)
4. Observable Behavior Checklist
- Which tools should be invoked?
- Which commands should be suggested (and in what order)?
- Which files should be created/modified/read?
- What should NOT happen (forbidden commands, unsafe operations)?
- What is an acceptable token/step budget?
Quality Checklist
Before confirming an eval design is complete:
- Success criteria are SMART, not vague
- Success criteria cover multiple dimensions
- Test dataset includes all 4 trigger categories
- ~25% of test cases are negative controls
- Edge cases from the taxonomy are represented
- Graders are layered (deterministic first, LLM second, human for calibration)
- Observable behavior is graded, not just text output
- LLM-as-judge includes calibration plan against human judgments
- Test dataset reflects production data distribution
- Initial test set is 10-20 cases with expansion plan from real failures
Common Pitfalls to Flag
When reviewing eval designs, actively check for:
- Vague criteria: “good performance” or “works well” — demand specific metrics
- Missing negative controls: All test cases are positive triggers — insist on ~25% negatives
- Output-only grading: Only checks final text — push for observable behavior checks
- Clean-only test data: All well-formed input — suggest edge cases (typos, ambiguity, multilingual)
- Uncalibrated LLM judges: No human validation — require calibration plan
- Speculation-driven expansion: Hypothetical edges — redirect to expanding from actual failures