advanced-evaluation
35
总安装量
35
周安装量
#5948
全站排名
安装命令
npx skills add https://github.com/shipshitdev/library --skill advanced-evaluation
Agent 安装分布
claude-code
26
antigravity
23
gemini-cli
22
codex
21
opencode
21
cursor
18
Skill 文档
Advanced Evaluation
LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches – choosing the right one and mitigating biases is the core competency.
When to Activate
- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards
- Debugging inconsistent evaluation results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation
Core Concepts
Evaluation Taxonomy
Direct Scoring: Single LLM rates one response on a defined scale.
- Best for: Objective criteria (factual accuracy, instruction following, toxicity)
- Reliability: Moderate to high for well-defined criteria
Pairwise Comparison: LLM compares two responses and selects better one.
- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences
Known Biases
| Bias | Description | Mitigation |
|---|---|---|
| Position | First-position preference | Swap positions, check consistency |
| Length | Longer = higher scores | Explicit prompting, length-normalized scoring |
| Self-Enhancement | Models rate own outputs higher | Use different model for evaluation |
| Verbosity | Unnecessary detail rated higher | Criteria-specific rubrics |
| Authority | Confident tone rated higher | Require evidence citation |
Decision Framework
Is there an objective ground truth?
âââ Yes â Direct Scoring (factual accuracy, format compliance)
âââ No â Pairwise Comparison (tone, style, creativity)
Quick Reference
Direct Scoring Requirements
- Clear criteria definitions
- Calibrated scale (1-5 recommended)
- Chain-of-thought: justification BEFORE score (improves reliability 15-25%)
Pairwise Comparison Protocol
- First pass: A in first position
- Second pass: B in first position (swap)
- Consistency check: If passes disagree â TIE
- Final verdict: Consistent winner with averaged confidence
Rubric Components
- Level descriptions with clear boundaries
- Observable characteristics per level
- Edge case guidance
- Strictness calibration (lenient/balanced/strict)
Integration
Works with:
- context-fundamentals – Effective context structure
- tool-design – Evaluation tool schemas
- evaluation (foundational) – Core evaluation concepts
For detailed implementation patterns, prompt templates, examples, and metrics: references/full-guide.md
See also: references/implementation-patterns.md, references/bias-mitigation.md, references/metrics-guide.md