ontos-skill-evaluator
1
总安装量
1
周安装量
#51002
全站排名
安装命令
npx skills add https://github.com/ontos-ai/skills-evaluator --skill ontos-skill-evaluator
Agent 安装分布
cursor
1
antigravity
1
gemini-cli
1
Skill 文档
Ontos Skill Evaluator
A meta-skill by Ontos AI that evaluates other Claude Skills through systematic quality assessment.
Installation
npx skills add ontos-ai/skills-evaluator
Quick Start
Node.js (Recommended for skills.sh users)
node scripts/quick_eval.js <path-to-skill>
node scripts/quick_eval.js <path-to-skill> --format html
Python (For local development)
python scripts/quick_eval.py <path-to-skill>
Example:
node scripts/quick_eval.js ../output/skills/ai-agent-trend-analysis --format html
Evaluation Dimensions
1. Structure (20%)
| Check | Description |
|---|---|
| Valid YAML frontmatter | Parseable, no duplicates |
| Required fields | name and description present |
| No illegal fields | Only name, description, optional license |
| Directory structure | SKILL.md at root, proper subdirs |
2. Trigger Quality (15%)
| Check | Description |
|---|---|
| Description triggers | Clear usage contexts in description |
| Trigger phrases | Explicit trigger examples in body |
| Diversity | Multiple trigger variations |
3. Actionability (25%)
| Check | Description |
|---|---|
| Concrete steps | Numbered or bulleted procedures |
| Tool references | Mentions scripts, APIs, or MCP tools |
| No vague language | Avoids “as needed”, “if necessary” without context |
4. Tool Integration (20%)
| Check | Description |
|---|---|
| Script references | Links to scripts/ files |
| Reference links | Links to references/ docs |
| Asset usage | Proper paths to assets/ |
5. Example Quality (20%)
| Check | Description |
|---|---|
| Non-placeholder | Uses realistic data, not [PLACEHOLDER] |
| Relevance | Examples match skill purpose |
| Output format | Clear expected output shown |
Output
Evaluation generates a JSON report:
{
"skill_id": "ai-agent-trend-analysis",
"evaluated_at": "2026-01-28T21:00:00Z",
"tier": "quick",
"scores": {
"overall": 0.72,
"structure": 0.60,
"triggers": 0.80,
"actionability": 0.75,
"tool_refs": 0.70,
"examples": 0.75
},
"issues": [
{"severity": "error", "code": "DUPLICATE_FRONTMATTER", "message": "..."},
{"severity": "warning", "code": "VAGUE_INSTRUCTION", "line": 45, "message": "..."}
],
"recommendations": ["Fix duplicate frontmatter", "Add concrete examples"],
"badge": "silver"
}
Badge Levels
| Badge | Score Range | Meaning |
|---|---|---|
| ð¥ Gold | â¥0.85 | Production ready |
| ð¥ Silver | 0.70-0.84 | Good with minor issues |
| ð¥ Bronze | 0.50-0.69 | Needs improvement |
| â Fail | <0.50 | Critical issues |
Advanced Usage
Evaluate All Skills in Directory
python scripts/quick_eval.py ../output/skills --batch
Output as Markdown Report
python scripts/quick_eval.py <path> --format md
Verbose Mode (Show All Checks)
python scripts/quick_eval.py <path> --verbose
Integration with Skill Generation
When used after skill-creator, this skill validates quality before distribution:
User Request â skill-creator â [New SKILL.md] â skill-evaluator â [Quality Report]
â
Fix issues if score < 0.70
Future: Tier 2 Deep Benchmark (Coming Soon)
Phase 2 will add optional deep testing:
- Semantic search for matching benchmark tasks
- Integration with OSWorld, SWE-Bench, AgentBench
- LLM-as-a-Judge evaluation
Invoke with --deep flag when available.