skill-evaluator

📁 mathews-tom/praxis-skills 📅 5 days ago
8
总安装量
7
周安装量
#36061
全站排名
安装命令
npx skills add https://github.com/mathews-tom/praxis-skills --skill skill-evaluator

Agent 安装分布

opencode 7
gemini-cli 7
antigravity 7
claude-code 7
github-copilot 7
codex 7

Skill 文档

Skill Evaluator

Skills that do not activate on relevant queries waste the entire investment in writing them. A skill can have deep, well-structured content and still deliver zero value if its frontmatter description lacks the trigger phrases users actually type. Quality evaluation catches trigger gaps, missing sections, and shallow content before deployment — turning a skill from a static document into a reliable tool.

Reference Files

File Contents
references/evaluation-rubric.md Detailed 1-5 scoring criteria per dimension, weight justifications, worked examples for calibration

Audit Modes

Two modes, selected by input:

  • Quick Audit: Evaluate a single skill. Produces a full per-dimension scored report with findings, severity classifications, and recommendations.
  • Full Audit: Evaluate all skills in the repository. Produces a comparative ranking table sorted by overall score, plus condensed per-skill summaries.

Mode Selection

Input Mode
Path to a specific skill directory or SKILL.md Quick Audit
“all”, “every skill”, no path specified, or repo-level request Full Audit
Multiple specific paths Quick Audit for each, then comparative summary

Evaluation Dimensions

Six dimensions, each scored 1-5. Weighted sum determines overall percentage.

D1: Frontmatter Quality (20%)

Evaluates the YAML frontmatter block for completeness and discoverability.

Signals:

  • name field present and non-empty
  • description field present and non-empty
  • Description length between 200-800 characters (sweet spot for keyword density without bloat)
  • Description contains explicit trigger phrases users would type
  • Description includes a “Use this skill when…” clause or equivalent
  • Description is keyword-dense, not generic filler

Scoring constraints: A description under 100 characters caps this dimension at 2/5. A missing name or description field caps at 1/5.

D2: Trigger Coverage (18%)

Evaluates whether the skill activates on the queries users actually type.

Signals:

  • Synonym breadth — multiple phrasings for the same intent (e.g., “review”, “audit”, “critique”, “evaluate”, “assess”, “check”)
  • Implied contexts — situations where the skill applies even without explicit keywords (e.g., “user provides a design doc and asks for feedback”)
  • Domain-specific terms relevant to the skill’s function
  • Explicit trigger phrase list in the description frontmatter
  • Coverage of both imperative (“review this”) and interrogative (“is this good?”) forms

Scoring constraints: Fewer than 3 distinct trigger phrases caps at 2/5. Zero trigger phrases in the description caps at 1/5.

D3: Structural Completeness (20%)

Evaluates whether the skill contains the sections needed to function reliably.

Signals:

  • Prerequisites or setup instructions (if applicable)
  • Multi-phase workflow or step-by-step procedure
  • Error handling guidance or edge case documentation
  • Output format specification (template, example, or schema)
  • Limitations or scope boundaries stated
  • Reference file table (if references/ directory exists)
  • Calibration rules or quality gates

Scoring constraints: A skill with no workflow section caps at 2/5. A skill with a workflow but no error handling or output format caps at 3/5.

D4: Content Depth (22%)

Evaluates the substantive quality of the skill’s guidance — whether it provides enough detail for an agent to execute well without human intervention.

Signals:

  • Multi-step workflows with decision points, not bare command lists
  • Error cases documented with recovery actions
  • Decision frameworks (when to do X vs Y, mode selection tables)
  • Verbatim output examples or templates
  • Severity classifications or scoring rubrics (where applicable)
  • Cross-cutting analysis or synthesis steps beyond simple checklists

Scoring constraints: A skill consisting only of bare commands with no explanatory context caps at 2/5. Reference files count toward this dimension only if they contain substantive guidance (checklists, rubrics, criteria), not just link collections.

D5: Consistency and Integrity (12%)

Evaluates internal consistency and structural integrity.

Signals:

  • Directory name matches the name field in frontmatter exactly
  • All files referenced in SKILL.md exist on disk (reference files, scripts, assets)
  • Description content aligns with body content (description does not promise features the body does not deliver)
  • Consistent terminology throughout (same concept uses same term)
  • No broken internal links or dangling references

Scoring constraints: A name mismatch between directory and frontmatter is a CRITICAL finding and caps at 1/5. Missing referenced files cap at 2/5.

D6: CONTRIBUTING.md Compliance (8%)

Evaluates adherence to the repository’s contribution guidelines.

Signals:

  • Skill name is kebab-case
  • Skill name is 64 characters or fewer
  • Description is 1024 characters or fewer
  • No angle brackets in description
  • No pushy trigger language in description (“always use”, “you must”, “never do”)
  • Valid YAML frontmatter syntax

Scoring constraints: Any single violation caps at 3/5. Multiple violations cap at 2/5. Invalid YAML that prevents parsing caps at 1/5.


Severity Classification

Severity Criteria Score Impact
CRITICAL Skill cannot activate or breaks on load — missing frontmatter, name mismatch, invalid YAML Caps overall score at 40%
HIGH Significant trigger gap or missing core section — no workflow, no error handling, zero trigger phrases Caps affected dimension at 3/5
MEDIUM Weak coverage, shallow content, few trigger synonyms Dimension needs improvement but functions
LOW Minor polish — formatting inconsistencies, slightly short description, missing calibration rules Fix when convenient

Workflow

Phase 1: Input

  1. Determine audit mode from user input (see Mode Selection table above).
  2. For Quick Audit: validate the skill directory exists and contains a SKILL.md file. If the path points to a SKILL.md directly, use its parent directory.
  3. For Full Audit: enumerate all directories under skills/ that contain a SKILL.md.
  4. For each skill to evaluate, note the directory name for D5 consistency checks.

Phase 2: Analysis

For each skill under evaluation:

  1. Read SKILL.md in full.
  2. Parse YAML frontmatter — extract name and description fields. If YAML parsing fails, record a CRITICAL finding and score D1 and D6 as 1/5.
  3. Check the references/ directory for existence and contents. Verify every file referenced in the SKILL.md body exists on disk.
  4. Evaluate each of the 6 dimensions using the criteria above and the detailed rubric in references/evaluation-rubric.md.
  5. Record findings with severity, dimension tag, description, and recommendation.

Phase 3: Scoring

  1. Score each dimension 1-5 using references/evaluation-rubric.md.
  2. Apply severity caps: if any CRITICAL finding exists, cap overall at 40% regardless of dimension scores.
  3. Compute weighted score: Overall% = (sum of dimension_score x weight) / 5 x 100.
  4. Determine verdict from the scale below.
Range Verdict
90-100% Exemplary
80-89% Strong
70-79% Adequate
60-69% Needs Work
Below 60% Deficient

Phase 4: Report

Generate the structured output using the appropriate template below.


Output Format

Quick Audit Template

## Skill Audit: {skill-name}

| Dimension | Score | Weight | Weighted | Key Finding |
|-----------|-------|--------|----------|-------------|
| D1: Frontmatter Quality | X/5 | 20% | X.XXX | ... |
| D2: Trigger Coverage | X/5 | 18% | X.XXX | ... |
| D3: Structural Completeness | X/5 | 20% | X.XXX | ... |
| D4: Content Depth | X/5 | 22% | X.XXX | ... |
| D5: Consistency & Integrity | X/5 | 12% | X.XXX | ... |
| D6: CONTRIBUTING Compliance | X/5 | 8% | X.XXX | ... |

**Overall: XX% — {Verdict}**

### Findings

[Severity-sorted list. Each entry includes dimension tag, severity, description,
evidence, and recommendation.]

- **[CRITICAL] D5:** ...
- **[HIGH] D2:** ...
- **[MEDIUM] D4:** ...
- **[LOW] D3:** ...

### Score Calculation

D1: {score} x 0.20 = {result}
D2: {score} x 0.18 = {result}
D3: {score} x 0.20 = {result}
D4: {score} x 0.22 = {result}
D5: {score} x 0.12 = {result}
D6: {score} x 0.08 = {result}
Sum = {weighted_sum}
Overall = {weighted_sum} / 5 x 100 = {percentage}% — {Verdict}

Full Audit Template

## Skill Repository Audit

| Skill | Overall | Verdict | Worst Dimension | Top Issue |
|-------|---------|---------|-----------------|-----------|
| {name} | XX% | {verdict} | {dimension} | {issue} |
| ... | ... | ... | ... | ... |

### Per-Skill Summaries

[Condensed Quick Audit for each skill: scorecard table, overall score, top 3 findings.
Omit the full Score Calculation section in condensed mode.]

Error Handling

Problem Cause Fix
SKILL.md not found in directory Path incorrect or file missing Report as CRITICAL; do not attempt evaluation; surface the path and stop
YAML frontmatter parse failure Invalid YAML syntax (unclosed quotes, bad indentation) Report as CRITICAL finding; score D1 and D6 as 1/5; continue evaluating the body content where parseable
references/ directory missing Skill has no reference files Not an error — score D5 normally; check only that any files referenced in the SKILL.md body actually exist on disk
references/ exists but referenced file is absent File path in SKILL.md body doesn’t resolve Record as a CRITICAL D5 finding; missing referenced files cap D5 at 2/5
Empty SKILL.md (zero bytes or whitespace only) File created but never populated Treat as CRITICAL; score all dimensions 1/5; overall verdict: Deficient
references/evaluation-rubric.md not found Skill’s own reference file missing Note the irony; evaluate using the criteria inline in this SKILL.md; flag D5 as a CRITICAL finding

Calibration Rules

  1. Score what exists, not what could exist — evaluate the skill as-is, not its potential.
  2. Weight trigger coverage heavily for skills targeting broad domains (e.g., a GitHub skill covers issues, PRs, CI, releases, and API — it needs proportionally more trigger synonyms).
  3. A skill with strong triggers but shallow content scores higher than deep content with poor triggers — activation is prerequisite to utility.
  4. Reference files count toward Content Depth only if they contain substantive guidance (checklists, rubrics, criteria), not link lists or stub files.
  5. When evaluating the skill-evaluator itself, apply identical standards — no self-inflation.
  6. Frontmatter description quality is the single highest-leverage improvement for any skill.