skill-evaluator

📁 mathews-tom/praxis-skills 📅 5 days ago

总安装量

周安装量

#36061

全站排名

安装命令

npx skills add https://github.com/mathews-tom/praxis-skills --skill skill-evaluator

Agent 安装分布

opencode 7

gemini-cli 7

antigravity 7

claude-code 7

github-copilot 7

codex 7

Skill 文档

Skill Evaluator

Skills that do not activate on relevant queries waste the entire investment in writing them. A skill can have deep, well-structured content and still deliver zero value if its frontmatter description lacks the trigger phrases users actually type. Quality evaluation catches trigger gaps, missing sections, and shallow content before deployment â turning a skill from a static document into a reliable tool.

Reference Files

File	Contents
`references/evaluation-rubric.md`	Detailed 1-5 scoring criteria per dimension, weight justifications, worked examples for calibration

Audit Modes

Two modes, selected by input:

Quick Audit: Evaluate a single skill. Produces a full per-dimension scored report with findings, severity classifications, and recommendations.
Full Audit: Evaluate all skills in the repository. Produces a comparative ranking table sorted by overall score, plus condensed per-skill summaries.

Mode Selection

Input	Mode
Path to a specific skill directory or SKILL.md	Quick Audit
“all”, “every skill”, no path specified, or repo-level request	Full Audit
Multiple specific paths	Quick Audit for each, then comparative summary

Evaluation Dimensions

Six dimensions, each scored 1-5. Weighted sum determines overall percentage.

D1: Frontmatter Quality (20%)

Evaluates the YAML frontmatter block for completeness and discoverability.

Signals:

name field present and non-empty
description field present and non-empty
Description length between 200-800 characters (sweet spot for keyword density without bloat)
Description contains explicit trigger phrases users would type
Description includes a “Use this skill when…” clause or equivalent
Description is keyword-dense, not generic filler

Scoring constraints: A description under 100 characters caps this dimension at 2/5. A missing name or description field caps at 1/5.

D2: Trigger Coverage (18%)

Evaluates whether the skill activates on the queries users actually type.

Signals:

Synonym breadth â multiple phrasings for the same intent (e.g., “review”, “audit”, “critique”, “evaluate”, “assess”, “check”)
Implied contexts â situations where the skill applies even without explicit keywords (e.g., “user provides a design doc and asks for feedback”)
Domain-specific terms relevant to the skill’s function
Explicit trigger phrase list in the description frontmatter
Coverage of both imperative (“review this”) and interrogative (“is this good?”) forms

Scoring constraints: Fewer than 3 distinct trigger phrases caps at 2/5. Zero trigger phrases in the description caps at 1/5.

D3: Structural Completeness (20%)

Evaluates whether the skill contains the sections needed to function reliably.

Signals:

Prerequisites or setup instructions (if applicable)
Multi-phase workflow or step-by-step procedure
Error handling guidance or edge case documentation
Output format specification (template, example, or schema)
Limitations or scope boundaries stated
Reference file table (if references/ directory exists)
Calibration rules or quality gates

Scoring constraints: A skill with no workflow section caps at 2/5. A skill with a workflow but no error handling or output format caps at 3/5.

D4: Content Depth (22%)

Evaluates the substantive quality of the skill’s guidance â whether it provides enough detail for an agent to execute well without human intervention.

Signals:

Multi-step workflows with decision points, not bare command lists
Error cases documented with recovery actions
Decision frameworks (when to do X vs Y, mode selection tables)
Verbatim output examples or templates
Severity classifications or scoring rubrics (where applicable)
Cross-cutting analysis or synthesis steps beyond simple checklists

Scoring constraints: A skill consisting only of bare commands with no explanatory context caps at 2/5. Reference files count toward this dimension only if they contain substantive guidance (checklists, rubrics, criteria), not just link collections.

D5: Consistency and Integrity (12%)

Evaluates internal consistency and structural integrity.

Signals:

Directory name matches the name field in frontmatter exactly
All files referenced in SKILL.md exist on disk (reference files, scripts, assets)
Description content aligns with body content (description does not promise features the body does not deliver)
Consistent terminology throughout (same concept uses same term)
No broken internal links or dangling references

Scoring constraints: A name mismatch between directory and frontmatter is a CRITICAL finding and caps at 1/5. Missing referenced files cap at 2/5.

D6: CONTRIBUTING.md Compliance (8%)

Evaluates adherence to the repository’s contribution guidelines.

Signals:

Skill name is kebab-case
Skill name is 64 characters or fewer
Description is 1024 characters or fewer
No angle brackets in description
No pushy trigger language in description (“always use”, “you must”, “never do”)
Valid YAML frontmatter syntax

Scoring constraints: Any single violation caps at 3/5. Multiple violations cap at 2/5. Invalid YAML that prevents parsing caps at 1/5.

Severity Classification

Severity	Criteria	Score Impact
CRITICAL	Skill cannot activate or breaks on load â missing frontmatter, name mismatch, invalid YAML	Caps overall score at 40%
HIGH	Significant trigger gap or missing core section â no workflow, no error handling, zero trigger phrases	Caps affected dimension at 3/5
MEDIUM	Weak coverage, shallow content, few trigger synonyms	Dimension needs improvement but functions
LOW	Minor polish â formatting inconsistencies, slightly short description, missing calibration rules	Fix when convenient

Workflow

Phase 1: Input

Determine audit mode from user input (see Mode Selection table above).
For Quick Audit: validate the skill directory exists and contains a SKILL.md file. If the path points to a SKILL.md directly, use its parent directory.
For Full Audit: enumerate all directories under skills/ that contain a SKILL.md.
For each skill to evaluate, note the directory name for D5 consistency checks.

Phase 2: Analysis

For each skill under evaluation:

Read SKILL.md in full.
Parse YAML frontmatter â extract name and description fields. If YAML parsing fails, record a CRITICAL finding and score D1 and D6 as 1/5.
Check the references/ directory for existence and contents. Verify every file referenced in the SKILL.md body exists on disk.
Evaluate each of the 6 dimensions using the criteria above and the detailed rubric in references/evaluation-rubric.md.
Record findings with severity, dimension tag, description, and recommendation.

Phase 3: Scoring

Score each dimension 1-5 using references/evaluation-rubric.md.
Apply severity caps: if any CRITICAL finding exists, cap overall at 40% regardless of dimension scores.
Compute weighted score: Overall% = (sum of dimension_score x weight) / 5 x 100.
Determine verdict from the scale below.

Range	Verdict
90-100%	Exemplary
80-89%	Strong
70-79%	Adequate
60-69%	Needs Work
Below 60%	Deficient

Phase 4: Report

Generate the structured output using the appropriate template below.

Output Format

Quick Audit Template

## Skill Audit: {skill-name}

| Dimension | Score | Weight | Weighted | Key Finding |
|-----------|-------|--------|----------|-------------|
| D1: Frontmatter Quality | X/5 | 20% | X.XXX | ... |
| D2: Trigger Coverage | X/5 | 18% | X.XXX | ... |
| D3: Structural Completeness | X/5 | 20% | X.XXX | ... |
| D4: Content Depth | X/5 | 22% | X.XXX | ... |
| D5: Consistency & Integrity | X/5 | 12% | X.XXX | ... |
| D6: CONTRIBUTING Compliance | X/5 | 8% | X.XXX | ... |

**Overall: XX% â {Verdict}**

### Findings

[Severity-sorted list. Each entry includes dimension tag, severity, description,
evidence, and recommendation.]

- **[CRITICAL] D5:** ...
- **[HIGH] D2:** ...
- **[MEDIUM] D4:** ...
- **[LOW] D3:** ...

### Score Calculation

D1: {score} x 0.20 = {result}
D2: {score} x 0.18 = {result}
D3: {score} x 0.20 = {result}
D4: {score} x 0.22 = {result}
D5: {score} x 0.12 = {result}
D6: {score} x 0.08 = {result}
Sum = {weighted_sum}
Overall = {weighted_sum} / 5 x 100 = {percentage}% â {Verdict}

Full Audit Template

## Skill Repository Audit

| Skill | Overall | Verdict | Worst Dimension | Top Issue |
|-------|---------|---------|-----------------|-----------|
| {name} | XX% | {verdict} | {dimension} | {issue} |
| ... | ... | ... | ... | ... |

### Per-Skill Summaries

[Condensed Quick Audit for each skill: scorecard table, overall score, top 3 findings.
Omit the full Score Calculation section in condensed mode.]

Error Handling

Problem	Cause	Fix
`SKILL.md` not found in directory	Path incorrect or file missing	Report as CRITICAL; do not attempt evaluation; surface the path and stop
YAML frontmatter parse failure	Invalid YAML syntax (unclosed quotes, bad indentation)	Report as CRITICAL finding; score D1 and D6 as 1/5; continue evaluating the body content where parseable
`references/` directory missing	Skill has no reference files	Not an error â score D5 normally; check only that any files referenced in the SKILL.md body actually exist on disk
`references/` exists but referenced file is absent	File path in SKILL.md body doesn’t resolve	Record as a CRITICAL D5 finding; missing referenced files cap D5 at 2/5
Empty `SKILL.md` (zero bytes or whitespace only)	File created but never populated	Treat as CRITICAL; score all dimensions 1/5; overall verdict: Deficient
`references/evaluation-rubric.md` not found	Skill’s own reference file missing	Note the irony; evaluate using the criteria inline in this SKILL.md; flag D5 as a CRITICAL finding

Calibration Rules

Score what exists, not what could exist â evaluate the skill as-is, not its potential.
Weight trigger coverage heavily for skills targeting broad domains (e.g., a GitHub skill covers issues, PRs, CI, releases, and API â it needs proportionally more trigger synonyms).
A skill with strong triggers but shallow content scores higher than deep content with poor triggers â activation is prerequisite to utility.
Reference files count toward Content Depth only if they contain substantive guidance (checklists, rubrics, criteria), not link lists or stub files.
When evaluating the skill-evaluator itself, apply identical standards â no self-inflation.
Frontmatter description quality is the single highest-leverage improvement for any skill.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台