research-extract

📁 katyella/research-extract 📅 8 days ago
3
总安装量
2
周安装量
#60054
全站排名
安装命令
npx skills add https://github.com/katyella/research-extract --skill research-extract

Agent 安装分布

amp 2
gemini-cli 2
github-copilot 2
codex 2
kimi-cli 2
opencode 2

Skill 文档

Research Extract

Ingest content from various sources and extract structured insights using parallel agent team processing.

First-Time Setup

Run once to install Python dependencies:

bash .claude/skills/research-extract/scripts/setup.sh

This checks for required system dependencies (yt-dlp, pdftotext, whisper).

Commands

When the user invokes this skill, determine their intent:

Ingest Commands

  • “ingest [url]” or “add [url]” → Ingest a new source
  • “ingest [url] as [slug]” → Ingest with custom slug name
  • “ingest [filepath]” → Ingest a local file (text, PDF, audio)

Analysis Commands

  • “analyze [slug]” or “extract from [slug]” → Run full extraction with agent team
  • “analyze [url] as [slug]” → Ingest AND run extraction in one step

Variant Commands

  • “variants [slug]” → Generate Show Notes + Cheat Sheet HTML from consolidated JSON

Query Commands

  • “list sources” → Show all ingested sources
  • “show [slug]” → Show source details and extractions
  • “progress” → Check extraction progress

Source Naming

Always use descriptive slugs, not numeric IDs.

When ingesting, derive or ask for a slug:

  • YouTube: mfm-6-ideas, lex-altman-interview
  • Blog: paul-graham-founder-mode
  • File: use filename without extension

The slug is used for all file paths and database lookups.


Parallel Extraction Workflow

This is the core workflow for extracting insights from content.

Step 1: Ingest the source

python3 .claude/skills/research-extract/scripts/ingest.py "[URL_OR_PATH]" --slug [SLUG]

This will:

  • Auto-detect source type (YouTube, blog, PDF, text, audio)
  • Download captions or transcribe with Whisper
  • Store metadata and transcript in .research-extract/sources/

Step 2: Chunk the transcript

python3 .claude/skills/research-extract/scripts/extract.py chunk --slug [SLUG]

This will:

  • Split transcript into ~15k character chunks
  • Save chunks to .research-extract/chunks/[slug]_chunk_N.json
  • Initialize progress tracking
  • Report number of chunks created

Step 3: Create team and spawn teammates

Create an Agent Team to coordinate parallel extraction:

3a. Create the team:

TeamCreate with team_name: "extract-[SLUG]"

3b. Create tasks – one task per chunk. This lets teammates self-balance workload:

For a transcript with 6 chunks, create 6 tasks:

TaskCreate: "Extract chunk 0 for [SLUG]"
TaskCreate: "Extract chunk 1 for [SLUG]"
...
TaskCreate: "Extract chunk 5 for [SLUG]"

Each task description should include the full processing instructions (see teammate prompt below).

3c. Spawn 3 teammates in a SINGLE message (parallel tool calls):

Task tool (x3) with:
  team_name: "extract-[SLUG]"
  name: "extractor-1" / "extractor-2" / "extractor-3"
  subagent_type: "general-purpose"

Teammate prompt template:

You are an extraction teammate on team "extract-[SLUG]".

Your workflow:
1. Check TaskList for pending tasks with no owner
2. Claim a task using TaskUpdate (set owner to your name, status to in_progress)
3. Process the chunk:
   a. Read chunk: python3 .claude/skills/research-extract/scripts/process_chunk.py show --slug [SLUG] --chunk-id [CHUNK_ID]
   b. Analyze the content for: key insights, notable quotes, themes, challenges, solutions/approaches, action items, frameworks/models, and external resources
   c. Save result: python3 .claude/skills/research-extract/scripts/process_chunk.py save --slug [SLUG] --chunk-id [CHUNK_ID] --result '<json>'
4. Mark task completed via TaskUpdate
5. Check TaskList again - claim the next pending task
6. Repeat until no pending tasks remain

JSON format for save:
{
  "chunk_id": X,
  "key_insights": [
    {"title": "...", "description": "...", "speaker": "Name", "quote": "...", "significance": "..."}
  ],
  "quotes": [
    {"quote": "...", "speaker": "Name", "context": "..."}
  ],
  "themes": ["theme1", "theme2"],
  "challenges": [
    {"title": "...", "description": "...", "speaker": "Name", "quote": "..."}
  ],
  "solutions_approaches": [
    {"title": "...", "description": "...", "speaker": "Name", "implementation": "...", "quote": "..."}
  ],
  "action_items": [
    {"action": "...", "context": "...", "speaker": "Name"}
  ],
  "frameworks_models": [
    {"name": "...", "description": "...", "speaker": "Name", "quote": "..."}
  ],
  "external_resources": [
    {"type": "book|podcast|tool|person|course|website|other", "name": "...", "author": "...", "speaker": "who mentioned", "context": "...", "quote": "..."}
  ]
}

IMPORTANT: Capture ALL external resources - books, people referenced/quoted, tools, platforms, podcasts, courses, frameworks.
Work autonomously. Claim tasks, process them, move to next. Stop when no tasks remain.

IMPORTANT:

  • Create 1 task per chunk (teammates self-balance by claiming work)
  • Spawn ALL teammates in a SINGLE message (parallel tool calls)
  • Use subagent_type="general-purpose" and team_name="extract-[SLUG]"
  • 3 teammates is the default; adjust only for very small (1-2 chunks = 1 teammate) or very large (15+ chunks = 5 teammates) jobs

Step 4: Wait for completion

Teammates send automatic idle notifications as they finish work. Check progress via:

TaskList - shows task completion status across all teammates

Fallback:

python3 .claude/skills/research-extract/scripts/extract.py progress

Step 5: Merge results and clean up team

Once all tasks show completed in TaskList:

5a. Merge results:

python3 .claude/skills/research-extract/scripts/extract.py merge --slug [SLUG]

This combines all chunk extractions into .research-extract/exports/[slug]_merged.json

5b. Shut down teammates:

SendMessage type: "shutdown_request" to extractor-1
SendMessage type: "shutdown_request" to extractor-2
SendMessage type: "shutdown_request" to extractor-3

5c. Clean up team:

TeamDelete

Step 6: Consolidate and rank

After merging, create a consolidated analysis that:

  • Deduplicates similar insights, challenges, and solutions
  • Ranks by frequency/importance
  • Identifies top quotes
  • Groups external resources by type
  • Counts theme frequency

Save to .research-extract/exports/[slug]_consolidated.json with this structure:

{
  "source": "Source title",
  "speakers": ["Speaker 1", "Speaker 2"],
  "url": "https://...",

  "key_insights": [
    {
      "rank": 1,
      "title": "Insight title",
      "description": "Synthesized description",
      "evidence": ["quote 1", "quote 2"],
      "speakers": ["Speaker 1"]
    }
  ],

  "themes": [
    {"theme": "Theme name", "frequency": 5}
  ],

  "challenges": [
    {
      "rank": 1,
      "title": "Challenge title",
      "description": "What the challenge is",
      "evidence": ["quote"],
      "speakers": ["Speaker 1"]
    }
  ],

  "solutions_approaches": [
    {
      "rank": 1,
      "title": "Solution title",
      "description": "What it is",
      "implementation": "How to do it",
      "evidence": ["quote"],
      "speakers": ["Speaker 1"]
    }
  ],

  "action_items": [
    {"action": "What to do", "context": "Why", "speaker": "Who"}
  ],

  "frameworks_models": [
    {"name": "Framework name", "description": "How it works", "speaker": "Who"}
  ],

  "top_quotes": [
    {
      "quote": "...",
      "speaker": "...",
      "context": "Why this quote matters"
    }
  ],

  "external_resources": {
    "books": [
      {"title": "Book Title", "author": "Author", "mentioned_by": "Speaker", "context": "Why referenced"}
    ],
    "people": [
      {"name": "Person Name", "mentioned_by": "Speaker", "context": "Why referenced"}
    ],
    "tools": [
      {"name": "Tool Name", "mentioned_by": "Speaker", "context": "How used"}
    ],
    "other": [
      {"type": "podcast|course|website|framework", "name": "...", "mentioned_by": "Speaker", "context": "..."}
    ]
  },

  "metadata": {
    "extraction_date": "2025-01-01T00:00:00",
    "total_chunks": 6,
    "source_type": "youtube"
  }
}

Step 7: Generate writeup files

After consolidation, generate two markdown files:

File 1: .research-extract/writeups/[slug]-notes.md (Quick Reference)

Structure:

  • Header with source, speakers, date
  • Sections matching main topics from the content
  • Bullet points with key facts
  • Timestamps as clickable links where available: [(4:57)](url#t=4m57s)
  • Key quotes in blockquotes with speaker attribution
  • Table of best quotes at bottom

File 2: .research-extract/writeups/[slug]-writeup.md (Essay)

Structure:

  • Introduction framing the topic’s significance
  • Body sections exploring each major theme with developed paragraphs
  • Quotes woven into prose with timestamp citations
  • Conclusion synthesizing implications

Writing style: See STYLE_GUIDE.md for detailed guidance on paragraph style, quote integration, voice, and punctuation rules.

Step 8: Generate variant pages (on demand)

When the user runs variants [slug]:

  1. Read consolidated JSON from .research-extract/exports/[slug]_consolidated.json
  2. Read raw transcript from .research-extract/sources/[slug].txt (for timestamps)
  3. Read template specs from VARIANTS.md
  4. Create output directory: .research-extract/variants/[slug]/
  5. Generate two self-contained HTML files:
    • show-notes.html — podcast companion page with timestamped sections, quote callouts, resource grids
    • cheat-sheet.html — print-optimized landscape reference card with color-coded sections
  6. Open both files in the browser: open .research-extract/variants/[slug]/show-notes.html .research-extract/variants/[slug]/cheat-sheet.html

Follow all layout, CSS, and generation rules in VARIANTS.md exactly. Every insight, quote, tool, framework, and action item from the consolidated JSON should appear in at least one template.

Step 9: Present summary to user

Show:

  • Source slug and title
  • Key insights found (ranked)
  • Challenges and solutions found (ranked)
  • Key themes with frequency
  • Top 3-5 quotes
  • External resources mentioned (books, people, tools, etc.)
  • Links to generated files

Storage Locations

All data stored in {project_root}/.research-extract/:

  • Source metadata: sources/[slug].json
  • Transcripts: sources/[slug].txt
  • Chunks: chunks/[slug]_chunk_N.json
  • Chunk results: exports/[slug]_chunk_N_result.json
  • Merged results: exports/[slug]_merged.json
  • Consolidated: exports/[slug]_consolidated.json
  • Progress: extraction_progress_[slug].json
  • Notes writeup: writeups/[slug]-notes.md
  • Essay writeup: writeups/[slug]-writeup.md
  • Variant pages: variants/[slug]/show-notes.html, variants/[slug]/cheat-sheet.html

For Queries

List sources (reads sources/*.json flat files):

python3 -c "
import sys; sys.path.insert(0, '.claude/skills/research-extract/scripts')
from db import list_sources
for s in list_sources():
    print(f'{s[\"slug\"]}: [{s[\"source_type\"]}] {s[\"title\"]}')"

Get source details (reads sources/{slug}.json + sources/{slug}.txt):

python3 -c "
import sys; sys.path.insert(0, '.claude/skills/research-extract/scripts')
from db import get_source_by_slug
source = get_source_by_slug('[SLUG]')
print(f'Title: {source[\"title\"]}')
print(f'Type: {source[\"source_type\"]}')"

Tips

  • Always use Agent Teams for large transcripts (>20k chars)
  • 3 teammates is the default; scale to 1 for tiny jobs (1-2 chunks) or 5 for large (15+ chunks)
  • Create 1 task per chunk, teammates self-balance by claiming work from TaskList
  • Merge deduplicates overlapping insights automatically
  • Consolidation step is where ranking and synthesis happens
  • Writeups and variants require the consolidation step to complete first
  • Run variants [slug] to generate Show Notes + Cheat Sheet HTML pages
  • Always shut down teammates and call TeamDelete after merging