evaluation-anchor-checker

📁 willoscar/research-units-pipeline-skills 📅 Jan 25, 2026

总安装量

周安装量

#29860

全站排名

安装命令

npx skills add https://github.com/willoscar/research-units-pipeline-skills --skill evaluation-anchor-checker

Agent 安装分布

gemini-cli 9

opencode 8

codex 8

claude-code 7

cursor 7

antigravity 6

Skill 文档

Evaluation Anchor Checker (make numbers reviewer-safe)

Purpose: fix a reviewer-magnet failure mode in agent surveys:

strong numeric/performance statements appear
but the minimal evaluation context is missing

This skill treats numeric claims as contracts:

if a number stays, the same sentence must contain enough protocol context to interpret it
if that context is not in evidence, the claim must be downgraded (no guessing)

Inputs

Preferred (pre-merge, keeps anchoring intact):

the affected sections/*.md files

Optional context (read-only; helps you avoid guessing):

outline/writer_context_packs.jsonl (look for evaluation_anchor_minimal, evaluation_protocol, anchor_facts)
outline/evidence_drafts.jsonl / outline/anchor_sheet.jsonl
citations/ref.bib

Outputs

Updated sections/*.md (or output/DRAFT.md if you are post-merge), with safer evaluation anchoring
Optional completion marker: output/eval_anchors_checked.refined.ok

Role prompt: Reviewer-minded Editor (evaluation hygiene)

You are a reviewer-minded editor for evaluation claims in a technical survey.

Goal:
- make every numeric/performance claim interpretable and reviewer-safe

Hard constraints:
- do not invent numbers
- do not add/remove/move citation keys
- if protocol context is missing, weaken or remove the numeric claim

Minimum context to include when keeping a number:
- task / setting (what kind of task)
- metric (what is being measured)
- constraint (budget/cost/tool access/horizon/seed/logging) when relevant

Avoid:
- ambiguous model naming that looks hallucinated (e.g., âGPT-5â) unless the cited paper uses it verbatim

Workflow (explicit inputs)

Use outline/writer_context_packs.jsonl to locate the subsection’s allowed citations and any extracted evaluation_protocol/anchor_facts.
Cross-check outline/evidence_drafts.jsonl and outline/anchor_sheet.jsonl for task/metric/constraint context before touching numbers.
Validate every cited key against citations/ref.bib (do not introduce new keys).

What to enforce (the âminimum protocol trioâ)

When a sentence contains digits (%, x, or numbers):

Keep the number only if you can attach at least 2 of the following in the same sentence without guessing:
- task family / benchmark name
- metric definition
- constraint (budget, tool access, cost model, retries, horizon)

If you cannot, downgrade:

remove the number and rewrite as qualitative (âoftenâ, âcanâ, âmayâ) with the same citation
or move the specificity into a verification target (âevaluations need to report â¦â) without adding new facts

Mini examples (paraphrase; do not copy)

Bad (underspecified):

Model X achieves 75% exact performance [@SomeBench].

Better (minimal context):

On <task/benchmark>, Model X reaches ~75% <metric>, under <constraint/budget/tool access> [@SomeBench].

Better (downgrade when context is missing):

Reported gains vary, but comparisons remain fragile when budgets and retry policies are not reported [@SomeBench].

Done checklist

No numeric claim remains without minimal protocol context.
No ambiguous model naming remains unless explicitly supported by citations.
Citation keys are unchanged.
If you removed/downgraded numbers, the paragraph still makes a defensible, evidence-bounded point.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台