dedupe-rank

📁 willoscar/research-units-pipeline-skills 📅 Jan 23, 2026

总安装量

周安装量

#20110

全站排名

安装命令

npx skills add https://github.com/willoscar/research-units-pipeline-skills --skill dedupe-rank

Agent 安装分布

claude-code 12

gemini-cli 11

codex 10

cursor 9

opencode 9

antigravity 7

Skill 文档

Dedupe + Rank

Turn a broad retrieved set into a smaller core set for taxonomy/outline building.

This is a deterministic âcurationâ step: it should be stable and repeatable.

Input

papers/papers_raw.jsonl

Outputs

papers/papers_dedup.jsonl
papers/core_set.csv

Workflow (high level)

Dedupe by normalized (title, year) and keep the richest metadata per duplicate cluster.
Rank by relevance/recency signals (and optionally pin known classics for certain topics). For LLM-agent topics, also ensure a small quota of prior surveys/reviews is present to support a paper-like Related Work section.
Write papers/core_set.csv with stable paper_id values and useful metadata columns (arxiv_id, pdf_url, categories).

Quality checklist

papers/papers_dedup.jsonl exists and is valid JSONL.
papers/core_set.csv exists and has a header row.

Script

Quick Start

python .codex/skills/dedupe-rank/scripts/run.py --help
python .codex/skills/dedupe-rank/scripts/run.py --workspace <workspace_dir> --core-size 300

All Options

--core-size <n>: target size for papers/core_set.csv
queries.md also supports core_size / core_set_size / dedupe_core_size (overrides default when present)

Examples

Smaller core set for fast iteration (non-A150++):
- python .codex/skills/dedupe-rank/scripts/run.py --workspace <ws> --core-size 25

Notes

This step may annotate papers/core_set.csv:reason with tags such as pinned_classic and prior_survey (deterministic, topic-aware guards for survey writing).
Systematic-review default: if the active pipeline is systematic-review and core_size is not specified, the script keeps the full deduped pool in papers/core_set.csv (so screening does not silently drop candidates).
This step is deterministic; reruns should be stable for the same inputs.

Troubleshooting

Common Issues

Issue: `papers/core_set.csv` is too small / empty

Symptom:

Core set has very few rows.

Causes:

Input papers/papers_raw.jsonl is small, or many rows are missing required fields.

Solutions:

Broaden retrieval (or provide a richer offline export) and rerun.
Lower --core-size only if you intentionally want a small core set.

Issue: Duplicates still appear after dedupe