skill-eval
npx skills add https://github.com/stanah/dotagents --skill skill-eval
Agent 安装分布
Skill 文档
skill-eval: ã¹ãã«è©ä¾¡ã¹ãã«
doc-code-sync ã¹ãã«ã®å質ã2å±¤ã§æ¤è¨¼ããè©ä¾¡ã¬ãã¼ããçæããã
è©ä¾¡å¯¾è±¡
| Layer | 対象 | ææ³ |
|---|---|---|
| Layer 1 | ã¨ã¯ã¹ãã©ã¯ã¿ã¹ã¯ãªãã | run-tests.sh ã«ããåä½ãã¹ãï¼æ±ºå®çï¼ |
| Layer 2 | ã¹ãã«ã¯ã¼ã¯ããã¼å ¨ä½ | ãã£ã¯ã¹ãã£ã«å¯¾ããæ¤åºçã»ç²¾åº¦è©ä¾¡ |
å©ç¨å¯è½ãªãã£ã¯ã¹ãã£
| åå | è¨èª | ç®ç | æå¾ |
|---|---|---|---|
mismatch-project |
TypeScript + Solidity | æ¤åºè½åãã¹ã | 髿¤åºç |
clean-project |
TypeScript | FP = 0 ãã¹ã | æ¤åºã¼ã |
tricky-project |
TypeScript | FP èªçºãã¹ã | ä½ FP |
python-project |
Python (FastAPI) | å¤è¨èªå¯¾å¿ãã¹ã | 髿¤åºç |
go-project |
Go (Gin) | å¤è¨èªå¯¾å¿ãã¹ã | 髿¤åºç |
monorepo-project |
TypeScript | ãã¬ã«ã¨ã¯ã¹ãã¼ã対å¿ãã¹ã | 髿¤åºç |
ã¯ã¼ã¯ããã¼
Step 1: ãã¹ã対象ã®ç¹å®
- 弿°ãããã¹ã対象ã¹ãã«ã¨ãã£ã¯ã¹ãã£ãå¤å®ãã:
- ããã©ã«ãã¹ãã«:
doc-code-sync - ããã©ã«ããã£ã¯ã¹ãã£:
mismatch-project - ãªãã·ã§ã³:
--fixture <name>ã§ç¹å®ãã£ã¯ã¹ãã£ãæå® - ãªãã·ã§ã³:
--allã§å ¨ãã£ã¯ã¹ãã£ããã¹ã - ãªãã·ã§ã³:
--runs Nã§è¤æ°åå®è¡ï¼çµ±è¨è©ä¾¡ï¼
- ããã©ã«ãã¹ãã«:
- ãã¹ããã£ã¯ã¹ãã£ã®åå¨ã確èªãã:
- Layer 1:
.claude/skills/doc-code-sync/tests/run-tests.sh - Layer 2:
.claude/skills/skill-eval/references/fixtures/<fixture-name>/
- Layer 1:
- Ground Truth ãã¡ã¤ã«ã®ãã¹ãè¨é²ããï¼ãã®æç¹ã§ã¯èªã¿è¾¼ã¾ãªãï¼:
.claude/skills/skill-eval/references/ground-truths/<fixture-name>.json
Step 2: Layer 1 å®è¡ï¼ã¨ã¯ã¹ãã©ã¯ã¿åä½ãã¹ãï¼
- ãã¹ãã¹ã¯ãªãããå®è¡ãã:
bash .claude/skills/doc-code-sync/tests/run-tests.sh - åºåãã PASS/FAIL ä»¶æ°ãè¨é²ããã
- FAIL ãããå ´åã失æãããã¹ãåã¨è©³ç´°ãè¨é²ããã
Step 3: Layer 2 å®è¡ï¼ã¯ã¼ã¯ããã¼è©ä¾¡ï¼
éè¦: è©ä¾¡ãã¤ã¢ã¹ãæé¤ãããããæ¤åºã¨è©ä¾¡ã2ãã§ã¼ãºã«åé¢ããã
Phase 1: æ¤åºãã§ã¼ãºï¼ãµãã¨ã¼ã¸ã§ã³ãï¼
Task ãã¼ã«ã使ç¨ãã¦ãµãã¨ã¼ã¸ã§ã³ããèµ·åããã³ã³ããã¹ãåé¢ãå®ç¾ãã:
Task ãã¼ã«å¼ã³åºã:
- subagent_type: "general-purpose"
- description: "doc-code-sync æ¤åºå®è¡"
- prompt: |
以ä¸ã®ãã£ã¬ã¯ããªã«å¯¾ã㦠`/doc-code-sync` ã¹ãã«ãå®è¡ãã¦ãã ãã:
対象ãã£ã¬ã¯ããª: .claude/skills/skill-eval/references/fixtures/<fixture-name>/
æé :
1. Skill ãã¼ã«ã§ `doc-code-sync` ãå¼ã³åºã
2. 対象ãã£ã¬ã¯ããªã `.claude/skills/skill-eval/references/fixtures/<fixture-name>/` ã«æå®
3. çæããã `.docstore/sync-report.md` ã®å
容ãåºåã¨ãã¦è¿ã
**ç¦æ¢äºé
**:
- `ground-truths/` ãã£ã¬ã¯ããªã¸ã®ã¢ã¯ã»ã¹ç¦æ¢
- `ground-truth` ãå«ããã¡ã¤ã«ã®èªã¿è¾¼ã¿ç¦æ¢
ã¿ã¹ã¯å®äºæãsync-report.md ã®å
¨å
容ãè¿ãã¦ãã ããã
Phase 2: è©ä¾¡ãã§ã¼ãºï¼è¦ªã¨ã¼ã¸ã§ã³ãï¼
ãµãã¨ã¼ã¸ã§ã³ãå®äºå¾ã親ã¨ã¼ã¸ã§ã³ããè©ä¾¡ãå®è¡ãã:
- ãµãã¨ã¼ã¸ã§ã³ãã®åºåï¼sync-report.md ã®å 容ï¼ãåå¾ããã
- ãã®æç¹ã§åã㦠ground-truth ãèªã¿è¾¼ã:
cat .claude/skills/skill-eval/references/ground-truths/<fixture-name>.json - Step 4 ã® Ground Truth æ¯è¼ãå®è¡ããã
Step 3b: è¤æ°åå®è¡ã¢ã¼ãï¼ãªãã·ã§ã³ï¼
--runs N ãªãã·ã§ã³æå®æ:
- Phase 1 ã N åå®è¡ããï¼ååãµãã¨ã¼ã¸ã§ã³ããèµ·åï¼
- ååã® sync-report.md ã
sync-report-{i}.mdã¨ãã¦ä¿å - Phase 2 ã§å
¨ã¬ãã¼ããéç´ããçµ±è¨ãç®åº:
- 平忤åºç (Mean Recall)
- æ¨æºåå·® (Std Dev)
- æå°/æå¤§æ¤åºç
- å®å®æ§ã¹ã³ã¢ (1 – CV, where CV = Std Dev / Mean)
- å®å®æ§ã¹ã³ã¢ã 0.8 以ä¸ã§ PASS
Step 4: Ground Truth æ¯è¼
æ¨æºãã£ã¯ã¹ãã£ï¼mismatch-project, python-project, go-project, monorepo-projectï¼
ground-truths/<fixture-name>.jsonã®expected_issuesé åã®åé ç®ã«ã¤ãã¦:- True Positive (TP): sync-report.md ã«å¯¾å¿ããæ¤åºé
ç®ãåå¨ããã
- ã«ãã´ãªãä¸è´ãã説æã®è¶£æ¨ãåè´ãã¦ããã° TP ã¨ããã
- ãã¡ã¤ã«ãã¹ãæå®ããã¦ããå ´åããã®ãã¡ã¤ã«ã¸ã®è¨åã確èªããã
- False Negative (FN): æå¾ ãããã sync-report.md ã«æ¤åºããã¦ããªãã
- False Positive (FP): sync-report.md ã«æ¤åºããã¦ããã ground-truth ã«æå¾
ããªãã
- FP ã®å¤å®ã¯ fuzzyï¼ææ§ä¸è´ï¼ã¨ããæããã«èª¤ã£ãæ¤åºã®ã¿ã«ã¦ã³ãããã
- True Positive (TP): sync-report.md ã«å¯¾å¿ããæ¤åºé
ç®ãåå¨ããã
- ã«ãã´ãªå¥ã®æ¤åºçãç®åºãã:
- æ¤åºç (Recall) = TP / (TP + FN)
- 精度 (Precision) = TP / (TP + FP)
å¦å®ãã¹ã: clean-project
- sync-report.md ã«æ¤åºãããåé¡ã ã¼ã ã§ãããã¨ã確èªããã
- ãããªãåé¡ãæ¤åºãããå ´åããã¹ã¦ False Positive ã¨ãã¦ã«ã¦ã³ãã
- åæ ¼åºæº: FP = 0
å¦å®ãã¹ã: tricky-project
ground-truths/tricky-project.jsonã®expected_non_issuesã確èªããã- åãã¿ã¼ã³ã«ã¤ãã¦:
- sync-report.md ã«ãã®ãã¿ã¼ã³ã誤ã£ã¦æ¤åºããã¦ããªãã確èªã
- æ¤åºããã¦ããã° False Positive ã¨ãã¦ã«ã¦ã³ãã
- åæ ¼åºæº: FP ⤠1 (max_false_positives)
Step 5: è©ä¾¡ã¬ãã¼ãçæ
.docstore/eval-report.md ã«ä»¥ä¸ã®å½¢å¼ã§åºåãã:
# Skill Evaluation Report
**対象ã¹ãã«**: doc-code-sync
**è©ä¾¡æ¥**: YYYY-MM-DD
**ãã£ã¯ã¹ãã£**: <fixture-name>
## Layer 1: ã¨ã¯ã¹ãã©ã¯ã¿ãã¹ã
| ãã¹ã | çµæ |
|--------|------|
| TypeScript JSON ã¹ãã¼ã | PASS/FAIL |
| TypeScript ã·ã³ãã«æ¤åº | PASS/FAIL |
| TypeScript ã«ã¼ãæ¤åº | PASS/FAIL |
| TypeScript è¨å®ãã¼æ¤åº | PASS/FAIL |
| Solidity ã³ã³ãã©ã¯ãæ¤åº | PASS/FAIL |
| Solidity NatSpec æ½åº | PASS/FAIL |
| åè¨ | X/Y PASS |
## Layer 2: ã¯ã¼ã¯ããã¼è©ä¾¡
### æ¤åºçãµããªã¼
| ã«ãã´ãª | æå¾
| æ¤åº | æ¤åºç |
|---------|------|------|--------|
| BROKEN_REF | N | N | X% |
| STALE_EXAMPLE | N | N | X% |
| UNDOCUMENTED | N | N | X% |
| CONFIG_DRIFT | N | N | X% |
| VERSION_DRIFT | N | N | X% |
| API_DRIFT | N | N | X% |
| NATSPEC_DRIFT | N | N | X% |
| MISSING_NATSPEC | N | N | X% |
### True Positives (æ¤åºæå)
- â
CATEGORY: 説æ
### False Negatives (æ¤åºæ¼ã)
- â CATEGORY: 説æ
### False Positives (誤æ¤åº)
- â ï¸ CATEGORY: 説æ
- (該å½ãªã ã®å ´åã¯ãã®æ¨ãè¨è¼)
## ç·åè©ä¾¡
**æ¤åºç (Recall)**: X% (TP / (TP + FN))
**精度 (Precision)**: X% (TP / (TP + FP))
**åæ ¼åºæº**: æ¤åºç 75% 以ä¸
**å¤å®**: PASS / FAIL
è¤æ°åå®è¡æã®ã¬ãã¼ãæ¡å¼µ
## Layer 2: ã¯ã¼ã¯ããã¼è©ä¾¡ï¼N=5 åå®è¡ï¼
### çµ±è¨ãµããªã¼
| ææ¨ | å¤ |
|------|-----|
| å®è¡åæ° | 5 |
| 平忤åºç | 87.5% |
| æ¨æºåå·® | 12.5% |
| æå°æ¤åºç | 75% |
| æå¤§æ¤åºç | 100% |
| å®å®æ§ã¹ã³ã¢ | 0.86 |
### ååã®çµæ
| Run | TP | FN | FP | Recall | Precision |
|-----|----|----|----| -------|-----------|
| 1 | 8 | 0 | 0 | 100% | 100% |
| 2 | 7 | 1 | 0 | 87.5% | 100% |
| ... |
### å¤å®
**å®å®æ§åºæº**: CV < 0.2 (å®å®æ§ã¹ã³ã¢ > 0.8)
**å¤å®**: PASS / FAIL
å¦å®ãã¹ãæã®ã¬ãã¼ãå½¢å¼
## Layer 2: å¦å®ãã¹ãè©ä¾¡
### ãã£ã¯ã¹ãã£: clean-project
**ç®ç**: False Positive = 0 ã®ç¢ºèª
| ææ¨ | çµæ |
|------|------|
| æ¤åºæ° | 0 |
| æå¾
å¤ | 0 |
| FP æ° | 0 |
| å¤å® | PASS |
### ãã£ã¯ã¹ãã£: tricky-project
**ç®ç**: FP èªçºãã¿ã¼ã³ã®æ£ããé¤å¤ç¢ºèª
| ãã¿ã¼ã³ | 誤æ¤åº | å¤å® |
|---------|--------|------|
| åå颿°ã®èª¤æ¤åº | ãªã | â
|
| ã³ã¡ã³ãå
ã³ã¼ãã®èª¤æ¤åº | ãªã | â
|
| ãã¹ããã¡ã¤ã«ã®èª¤æ¤åº | ãªã | â
|
| deprecated 颿°ã®èª¤æ¤åº | ãªã | â
|
| ææ¨ | çµæ |
|------|------|
| FP æ° | 0 |
| 許容 FP | 1 |
| å¤å® | PASS |
Step 6: ã¿ã¼ããã«ãµããªã¼è¡¨ç¤º
## Skill Evaluation å®äº
### Layer 1: ã¨ã¯ã¹ãã©ã¯ã¿ãã¹ã
X/Y PASS
### Layer 2: ã¯ã¼ã¯ããã¼è©ä¾¡
ãã£ã¯ã¹ãã£: <name>
æ¤åºç: X% (TP/æå¾
ä»¶æ°)
精度: X% (TP/(TP+FP))
### å¤å®
(PASS: æ¤åºç 75% ä»¥ä¸ / FAIL: æ¤åºç 75% æªæº)
詳細ã¬ãã¼ã: .docstore/eval-report.md
åæ ¼åºæº
| ãã£ã¯ã¹ã㣠| ææ¨ | åºæº |
|---|---|---|
| mismatch-project | Recall | ⥠75% |
| python-project | Recall | ⥠75% |
| go-project | Recall | ⥠75% |
| monorepo-project | Recall | ⥠75% |
| clean-project | FP | = 0 |
| tricky-project | FP | ⤠1 |
| è¤æ°åå®è¡ | å®å®æ§ã¹ã³ã¢ | ⥠0.8 (CV < 0.2) |
注æäºé
- Layer 1 ã¯æ±ºå®çãã¹ãã§ãããå ¨ PASS ãæå¾ ããããFAIL ãããå ´åã¯ã¨ã¯ã¹ãã©ã¯ã¿ã®ãã°ã示ãã
- Layer 2 ã®æ¤åºã¯ LLM ãã¼ã¹ã®ãããå®è¡ãã¨ã«çµæãç°ãªãå¯è½æ§ãããã
- Ground Truth æ¯è¼ã¯æå³çä¸è´ï¼fuzzy matchingï¼ã§è¡ããå®å ¨ä¸è´ã¯æ±ããªãã
- ãã£ã¯ã¹ãã£ã夿´ããå ´åã¯å¯¾å¿ãã ground-truth ãã¡ã¤ã«ãæ´æ°ãããã¨ã
- å¦å®ãã¹ã㯠FP 測å®ã主ç®çã§ãããRecall ã¯è©ä¾¡ããªãã
ã³ã³ããã¹ãåé¢ã®è¨è¨æ ¹æ
Layer 2 ã§ Task ãã¼ã«ã使ç¨ããçç±:
- è©ä¾¡ãã¤ã¢ã¹ã®æé¤: ãµãã¨ã¼ã¸ã§ã³ãã¯è¦ªã®ã³ã³ããã¹ããå¼ãç¶ããªããããground-truth ãç¥ããªãç¶æ ã§æ¤åºãå®è¡ã§ããã
- çã®æ¤åºè½å測å®: ãçããè¦ã¦ãããã¹ããåãããç¶æ ãé²ããã¹ãã«ã®ç´ç²ãªæ¤åºè½åãè©ä¾¡ã§ããã
- åç¾æ§åä¸: æ¤åºã¨è©ä¾¡ãæç¢ºã«åé¢ãããçµæã®æ¤è¨¼ã容æã«ãªãã
Ground Truth ãã¡ã¤ã«ã¯ references/ground-truths/ ã«é
ç½®ããããã£ã¯ã¹ãã£ãã£ã¬ã¯ããªã¨ã¯åé¢ããã¦ãããããã«ããããµãã¨ã¼ã¸ã§ã³ãããã£ã¯ã¹ãã£ãæ¢ç´¢ãã¦ã ground-truth ã«ã¢ã¯ã»ã¹ãããã¨ã¯ãªãã