github-research
npx skills add https://github.com/lingzhi227/agent-research-skills --skill github-research
Agent 安装分布
Skill 文档
GitHub Research Skill
Trigger
Activate this skill when the user wants to:
- “Find repos for [topic]”, “GitHub research on [topic]”
- “Analyze open-source code for [topic]”
- “Find implementations of [paper/technique]”
- “Which repos implement [algorithm]?”
- Uses
/github-research <deep-research-output-dir>slash command
Overview
This skill systematically discovers, evaluates, and deeply analyzes GitHub repositories related to a research topic. It reads deep-research output (paper database, phase reports, code references) and produces an actionable integration blueprint for reusing open-source code.
Installation: ~/.claude/skills/github-research/ â scripts, references, and this skill definition.
Output: ./github-research-output/{slug}/ relative to the current working directory.
Input: A deep-research output directory (containing paper_db.jsonl, phase reports, code_repos.md, etc.)
6-Phase Pipeline
Phase 1: Intake â Extract refs, URLs, keywords from deep-research output
Phase 2: Discovery â Multi-source broad GitHub search (50-200 repos)
Phase 3: Filtering â Score & rank â select top 15-30 repos
Phase 4: Deep Dive â Clone & deeply analyze top 8-15 repos (code reading)
Phase 5: Analysis â Per-repo reports + cross-repo comparison
Phase 6: Blueprint â Integration/reuse plan for research topic
Output Directory Structure
github-research-output/{slug}/
âââ repo_db.jsonl # Master repo database
âââ phase1_intake/
â âââ extracted_refs.jsonl # URLs, keywords, paper-repo links
â âââ intake_summary.md
âââ phase2_discovery/
â âââ search_results/ # Raw JSONL from each search
â âââ discovery_log.md
âââ phase3_filtering/
â âââ ranked_repos.jsonl # Scored & ranked subset
â âââ filtering_report.md
âââ phase4_deep_dive/
â âââ repos/ # Cloned repos (shallow)
â âââ analyses/ # Per-repo analysis .md files
â âââ deep_dive_summary.md
âââ phase5_analysis/
â âââ comparison_matrix.md # Cross-repo comparison
â âââ technique_map.md # Paper concept â code mapping
â âââ analysis_report.md
âââ phase6_blueprint/
âââ integration_plan.md # How to combine repos
âââ reuse_catalog.md # Reusable components catalog
âââ final_report.md # Complete compiled report
âââ blueprint_summary.md
Scripts Reference
All scripts are Python 3, stdlib-only, located in ~/.claude/skills/github-research/scripts/.
| Script | Purpose | Key Flags |
|---|---|---|
extract_research_refs.py |
Parse deep-research output for GitHub URLs, paper refs, keywords | --research-dir, --output |
search_github.py |
Search GitHub repos via gh api |
--query, --language, --min-stars, --sort, --max-results, --topic, --output |
search_github_code.py |
Search GitHub code for implementations | --query, --language, --filename, --max-results, --output |
search_paperswithcode.py |
Search Papers With Code for paperârepo mappings | --paper-title, --arxiv-id, --query, --output |
repo_db.py |
JSONL repo database management | subcommands: merge, filter, score, search, tag, stats, export, rank |
repo_metadata.py |
Fetch detailed metadata via gh api |
--repos, --input, --output, --delay |
clone_repo.py |
Shallow-clone repos for analysis | --repo, --output-dir, --depth, --branch |
analyze_repo_structure.py |
Map file tree, key files, LOC stats | --repo-dir, --output |
extract_dependencies.py |
Extract and parse dependency files | --repo-dir, --output |
find_implementations.py |
Search cloned repo for specific code patterns | --repo-dir, --patterns, --output |
repo_readme_fetch.py |
Fetch README without cloning | --repos, --input, --output, --max-chars |
compare_repos.py |
Generate comparison matrix across repos | --input, --output |
compile_github_report.py |
Assemble final report from all phases | --topic-dir |
Phase 1: Intake
Goal: Extract all relevant references, URLs, and keywords from the deep-research output.
Steps
-
Create output directory structure:
SLUG=$(echo "$TOPIC" | tr '[:upper:]' '[:lower:]' | tr ' ' '-' | tr -cd 'a-z0-9-') mkdir -p github-research-output/$SLUG/{phase1_intake,phase2_discovery/search_results,phase3_filtering,phase4_deep_dive/{repos,analyses},phase5_analysis,phase6_blueprint} -
Extract references from deep-research output:
python ~/.claude/skills/github-research/scripts/extract_research_refs.py \ --research-dir <deep-research-output-dir> \ --output github-research-output/$SLUG/phase1_intake/extracted_refs.jsonl -
Review extracted refs: Read the generated JSONL. Note:
- GitHub URLs found directly in reports
- Paper titles and arxiv IDs (for Papers With Code lookup)
- Research keywords and themes (for GitHub search queries)
-
Write intake summary: Create
phase1_intake/intake_summary.mdwith:- Number of direct GitHub URLs found
- Number of papers with potential code links
- Key research themes extracted
- Planned search queries for Phase 2
Checkpoint
extracted_refs.jsonlexists with entriesintake_summary.mdwritten- Search strategy documented
Phase 2: Discovery
Goal: Cast a wide net to find 50-200 candidate repos from multiple sources.
Steps
-
Search by direct URLs: Any GitHub URLs from Phase 1 â fetch metadata:
python ~/.claude/skills/github-research/scripts/repo_metadata.py \ --repos owner1/name1 owner2/name2 ... \ --output github-research-output/$SLUG/phase2_discovery/search_results/direct_urls.jsonl -
Search Papers With Code: For each paper with an arxiv ID:
python ~/.claude/skills/github-research/scripts/search_paperswithcode.py \ --arxiv-id 2401.12345 \ --output github-research-output/$SLUG/phase2_discovery/search_results/pwc_2401.12345.jsonl -
Search GitHub by keywords (3-8 queries based on research themes):
python ~/.claude/skills/github-research/scripts/search_github.py \ --query "multi-agent LLM coordination" \ --min-stars 10 --sort stars --max-results 50 \ --output github-research-output/$SLUG/phase2_discovery/search_results/gh_query1.jsonl -
Search GitHub code (for specific implementations):
python ~/.claude/skills/github-research/scripts/search_github_code.py \ --query "class MultiAgentOrchestrator" \ --language python --max-results 30 \ --output github-research-output/$SLUG/phase2_discovery/search_results/code_query1.jsonl -
Fetch READMEs for repos that lack descriptions:
python ~/.claude/skills/github-research/scripts/repo_readme_fetch.py \ --input <repos.jsonl> \ --output github-research-output/$SLUG/phase2_discovery/search_results/readmes.jsonl -
Merge all results into master database:
python ~/.claude/skills/github-research/scripts/repo_db.py merge \ --inputs github-research-output/$SLUG/phase2_discovery/search_results/*.jsonl \ --output github-research-output/$SLUG/repo_db.jsonl -
Write discovery log: Create
phase2_discovery/discovery_log.mdwith search queries used, results per source, total unique repos found.
Rate Limits
- GitHub search API: 30 requests/minute (authenticated)
- Papers With Code API: No strict limit but be respectful (1 req/sec)
- Add
--delay 1.0to batch operations when needed
Checkpoint
repo_db.jsonlpopulated with 50-200 reposdiscovery_log.mdwith search details
Phase 3: Filtering
Goal: Score and rank repos, select top 15-30 for deeper analysis.
Steps
-
Enrich metadata for all repos:
python ~/.claude/skills/github-research/scripts/repo_metadata.py \ --input github-research-output/$SLUG/repo_db.jsonl \ --output github-research-output/$SLUG/repo_db.jsonl \ --delay 0.5 -
Score repos (quality + activity scores):
python ~/.claude/skills/github-research/scripts/repo_db.py score \ --input github-research-output/$SLUG/repo_db.jsonl \ --output github-research-output/$SLUG/repo_db.jsonl -
LLM relevance scoring: Read through the top ~50 repos (by quality_score) and assign
relevance_score(0.0-1.0) based on:- Direct relevance to research topic
- Implementation completeness
- Code quality signals (from README, description)
- Update the relevance scores:
python ~/.claude/skills/github-research/scripts/repo_db.py tag \ --input github-research-output/$SLUG/repo_db.jsonl \ --ids owner/name --tags "relevance:0.85" -
Compute composite scores and rank:
python ~/.claude/skills/github-research/scripts/repo_db.py score \ --input github-research-output/$SLUG/repo_db.jsonl \ --output github-research-output/$SLUG/repo_db.jsonl python ~/.claude/skills/github-research/scripts/repo_db.py rank \ --input github-research-output/$SLUG/repo_db.jsonl \ --output github-research-output/$SLUG/phase3_filtering/ranked_repos.jsonl \ --by composite_score -
Select top repos: Filter to top 15-30:
python ~/.claude/skills/github-research/scripts/repo_db.py filter \ --input github-research-output/$SLUG/phase3_filtering/ranked_repos.jsonl \ --output github-research-output/$SLUG/phase3_filtering/ranked_repos.jsonl \ --max-repos 30 --not-archived -
Write filtering report: Create
phase3_filtering/filtering_report.md:- Stats before/after filtering
- Score distributions
- Top 30 repos with scores and rationale
Scoring Formula
activity_score = sigmoid((days_since_push < 90) * 0.4 + has_recent_commits * 0.3 + open_issues_ratio * 0.3)
quality_score = normalize(log(stars+1) * 0.3 + log(forks+1) * 0.2 + has_license * 0.15 + has_readme * 0.15 + not_archived * 0.2)
composite_score = relevance * 0.4 + quality * 0.35 + activity * 0.25
Checkpoint
ranked_repos.jsonlwith 15-30 reposfiltering_report.mdwith scoring details
Phase 4: Deep Dive
Goal: Clone and deeply analyze the top 8-15 repos.
Steps
-
Select repos for deep dive: Take top 8-15 from ranked list.
-
Clone each repo (shallow):
python ~/.claude/skills/github-research/scripts/clone_repo.py \ --repo owner/name \ --output-dir github-research-output/$SLUG/phase4_deep_dive/repos/ -
Analyze structure for each cloned repo:
python ~/.claude/skills/github-research/scripts/analyze_repo_structure.py \ --repo-dir github-research-output/$SLUG/phase4_deep_dive/repos/name/ \ --output github-research-output/$SLUG/phase4_deep_dive/analyses/name_structure.json -
Extract dependencies:
python ~/.claude/skills/github-research/scripts/extract_dependencies.py \ --repo-dir github-research-output/$SLUG/phase4_deep_dive/repos/name/ \ --output github-research-output/$SLUG/phase4_deep_dive/analyses/name_deps.json -
Find implementations: Search for key algorithms/concepts from research:
python ~/.claude/skills/github-research/scripts/find_implementations.py \ --repo-dir github-research-output/$SLUG/phase4_deep_dive/repos/name/ \ --patterns "class Transformer" "def forward" "attention" \ --output github-research-output/$SLUG/phase4_deep_dive/analyses/name_impls.jsonl -
Deep code reading: For each repo, READ the key source files identified by structure analysis. Write a per-repo analysis in
phase4_deep_dive/analyses/{name}_analysis.md:- Architecture overview
- Key algorithms implemented
- Code quality assessment
- API / interface design
- Dependencies and requirements
- Strengths and limitations
- Reusability assessment (how easy to extract components)
-
Write deep dive summary:
phase4_deep_dive/deep_dive_summary.md
IMPORTANT: Actually Read Code
Do NOT just summarize READMEs. You must:
- Read the main source files (entry points, core modules)
- Understand the actual implementation approach
- Identify specific functions/classes that implement research concepts
- Note code patterns, design decisions, and trade-offs
Checkpoint
- Repos cloned in
repos/ - Per-repo analysis files in
analyses/ deep_dive_summary.mdwritten
Phase 5: Analysis
Goal: Cross-repo comparison and technique-to-code mapping.
Steps
-
Generate comparison matrix:
python ~/.claude/skills/github-research/scripts/compare_repos.py \ --input github-research-output/$SLUG/phase4_deep_dive/analyses/ \ --output github-research-output/$SLUG/phase5_analysis/comparison.json -
Write comparison matrix: Create
phase5_analysis/comparison_matrix.md:- Table comparing repos across dimensions (language, LOC, stars, framework, license, tests)
- Dependency overlap analysis
- Strengths/weaknesses per repo
-
Write technique map: Create
phase5_analysis/technique_map.md:- Map each paper concept / research technique â specific repo + file + function
- Identify gaps (techniques with no implementation found)
- Note alternative implementations of the same concept
-
Write analysis report:
phase5_analysis/analysis_report.md:- Executive summary of findings
- Key insights from code analysis
- Recommendations for which repos to use for which purposes
Checkpoint
comparison_matrix.mdwith repo comparison tabletechnique_map.mdmapping concepts to codeanalysis_report.mdwith findings
Phase 6: Blueprint
Goal: Produce an actionable integration and reuse plan.
Steps
-
Write integration plan:
phase6_blueprint/integration_plan.md:- Recommended architecture for combining repos
- Step-by-step integration approach
- Dependency resolution strategy
- Potential conflicts and how to resolve them
-
Write reuse catalog:
phase6_blueprint/reuse_catalog.md:- For each reusable component: source repo, file path, function/class, what it does, how to extract it
- License compatibility matrix
- Effort estimates (easy/medium/hard to integrate)
-
Compile final report:
python ~/.claude/skills/github-research/scripts/compile_github_report.py \ --topic-dir github-research-output/$SLUG/ -
Write blueprint summary:
phase6_blueprint/blueprint_summary.md:- One-page executive summary
- Top 5 repos and why
- Recommended next steps
Checkpoint
integration_plan.mdcompletereuse_catalog.mdwith component catalogfinal_report.mdcompiledblueprint_summary.mdas executive summary
Quality Conventions
- Repos are ranked by composite score:
relevance à 0.4 + quality à 0.35 + activity à 0.25 - Deep dive requires reading actual code, not just READMEs
- Integration blueprint must map paper concepts â specific code files/functions
- Incremental saves: Each phase writes to disk immediately
- Checkpoint recovery: Can resume from any phase by checking what outputs exist
- All scripts are stdlib-only Python â no pip installs needed
ghCLI is required for GitHub API access (must be authenticated)- Deduplication by
repo_id(owner/name) across all searches - Rate limit awareness: Respect GitHub search API limits (30 req/min)
Error Handling
- If
ghis not installed: warn user and provide installation instructions - If a repo is archived/deleted: skip gracefully, note in log
- If clone fails: skip, note in log, continue with remaining repos
- If Papers With Code API is down: skip, rely on GitHub search only
- Always write partial progress to disk so work is not lost
References
- See
references/phase-guide.mdfor detailed phase execution guidance - Deep-research skill:
~/.claude/skills/deep-research/SKILL.md - Paper database pattern:
~/.claude/skills/deep-research/scripts/paper_db.py