content-claw
npx skills add https://github.com/sene1337/content-claw --skill content-claw
Agent 安装分布
Skill 文档
Content Claw ð¦
Analyze external content against the user’s goals, frameworks, and time value. Extract the insights so they don’t have to consume the full content unless it’s truly worth their time.
Folder structure:
docs/content-claw/caught/â ð Read and ð¬ Watch verdicts. The keepers.docs/content-claw/released/skim/â ð Skim verdicts. Monthly batch review, then purged.docs/content-claw/released/skip/â âï¸ Skip verdicts. Auto-purge after 14 days.
File naming scheme: {category}--{source}--{title-slug}.md
- Category:
ai-agents,bitcoin,health,media,business,tech,finance,culture,ops(extend as needed) - Source type:
article,video,podcast,thread,paper - Title slug: Short descriptive slug using hyphens
- Separator: Double-dash
--(doesn’t conflict with hyphens in slugs) - Example:
ai-agents--article--clawvault-memory-architecture.md
At a glance you can see category, format, and topic without opening the file.
Workflow
Step -1: Duplicate Check ââ
Before ANY extraction work, check if this URL has already been reviewed:
grep -rl "<cleaned-URL>" docs/content-claw/(search all caught + released folders)- Also check partial URL matches (e.g., tweet ID, YouTube video ID) in case the URL format differs
- If found: Tell the user it’s already been reviewed, show the file path and verdict. Ask if they want a re-review or update.
- If not found: Proceed to Step 0.
This prevents duplicate files and wasted extraction work. No exceptions.
Step 0: Pre-Flight Checks â
Verify tools and set extraction strategy before starting. See references/environment-setup.md for full checklist.
Step 0.5: Sanitize URLs ð§¹
Before fetching or searching ANY shared URL, strip tracking parameters:
- Twitter/X: Remove
?s=,&s=,?t=,&t=,?ref_src=,&ref_src= - YouTube: Remove
&si=,?si=,&feature=,?feature=,&pp= - Universal: Remove
utm_source,utm_medium,utm_campaign,utm_term,utm_content,utm_id,fbclid,gclid,igshid,ref,mc_cid,mc_eid - Rule: Strip everything after
?or&that matches these patterns. Keep essential params (e.g., YouTubev=,t=for timestamps,list=for playlists).
Example: https://youtube.com/watch?v=abc123&si=tracking_garbage&t=120 â https://youtube.com/watch?v=abc123&t=120
Step 1: Detect Content Type
From the URL, determine:
- YouTube video â needs transcription
- Article/blog â needs extraction
- Tweet/thread â needs thread unwinding
- Podcast â needs transcription (if audio available)
Duration check (YouTube/podcasts): Before spawning extraction, check video length via web_search or yt-dlp metadata (yt-dlp --print duration [URL]). If >60 minutes, warn the user:
“This is a [X]-minute video. Full transcription will take a while. Want me to: (a) do full extraction, (b) extract intro + key chapters only, or (c) search for existing summaries/highlights?”
For videos <60 minutes, proceed automatically.
Step 2: Extract Content
â ï¸ Context Window Protection Rule: The purpose of this skill is to keep the main agent’s context clean. Heavy extraction work MUST happen in a sub-agent, not in the main session.
- Tier 1 (web-based, <3k tokens): OK to run in main agent â lightweight metadata and search results
- Tier 2 (transcript/content extraction, 5k+ tokens): MUST spawn a sub-agent. No exceptions. The main agent’s job is to orchestrate and review, not to do heavy extraction.
- Model routing: Sub-agents should use a cheaper model (e.g., Sonnet) for extraction work. The main agent handles judgment (verdict, relevance, actions) on its primary model.
If Tier 1 produces enough content for a solid review â proceed to Step 3. If Tier 1 is insufficient â spawn a sub-agent for Tier 2. Do NOT do Tier 2 work yourself.
Use the fallback hierarchy â stop at the first tier that works:
Tier 1: Lightweight web extraction (~500-2.5k tokens) â Try first, main agent
- YouTube:
web_search "[video title] transcript"orweb_search "[video-id] transcript" - YouTube:
web_fetchon known transcript services (e.g.,kome.ai/api/transcript?url=[URL]) - YouTube: oEmbed for metadata (
https://www.youtube.com/oembed?url=[URL]&format=json) - Articles:
web_fetchwith markdown extraction - Tweets:
api.fxtwitter.com/[user]/status/[id] - If sufficient content extracted â skip to Step 3
Tier 2: Sub-agent extraction (~5k tokens) â ALWAYS a sub-agent, NEVER main agent
- Spawn sub-agent (use cheaper model like Sonnet) with explicit instructions (see references/sub-agent-prompt.md)
- Critical: After spawning, WAIT for the sub-agent to complete. Do NOT attempt parallel extraction.
- Verify output file exists and has substantive content before proceeding
- Sub-agent session is disposable â its context gets thrown away after extraction, keeping your main context clean
Tier 3: Ask user for help (~1k tokens) â Last resort
- “I couldn’t extract this content automatically. Could you paste the transcript/key points, or should I work from the title and description only?”
Retry Limits (hard caps):
web_search: max 2 attempts per URLweb_fetch: max 1 attempt per specific URLyt-dlp: max 1 attempt (if blocked, it’s blocked â don’t retry)- Total retries across all methods: abort if >5
- If a method returns an error, log it and move to next tier â don’t repeat
Output Verification (after any extraction):
- Check that the output file exists at the expected path
- Check that it has substantive content (not empty, not just a plan/outline)
- If verification fails after Tier 2, fall through to Tier 3 â don’t retry the same approach
Step 3: Review Against Frameworks
After verified extraction, review the content against:
- Relevance Filter: Does this relate to the user’s active goals and priorities? Check USER.md if it exists.
- Novelty Check: Does the user (or their agent) already have this knowledge? Check existing docs for overlap.
- Action Density: How many actionable insights per minute of content? High action density = worth consuming. Low = extract and move on.
- Time Value Test: Is consuming this content the highest-value use of the user’s time, or can the insight be captured faster from the extraction?
Step 4: Deliver Verdict
Format the response as:
Verdict emoji + label:
- âï¸ Skip â Already known or not relevant. Here’s the 1-2 things worth noting.
- ð Skim â Some useful bits but not worth full attention. Here are the highlights.
- ð Read â Worth reading the summary. Key frameworks extracted below.
- ð¬ Watch â Visual/demo content that loses value in text. Worth the time investment.
Then provide:
ð Actions (things to do based on this content):
- [Immediate actions, research tasks, things to add to systems]
ð¡ Insights (worth knowing, no action needed):
- [Key frameworks, mental models, interesting data points]
Also note:
- What’s new vs what’s already known
- Source quality assessment (credible? experienced? selling something?)
Step 5: Save Review
Source block (required at top of every review file):
## Source
- **Title:** [Content title]
- **Author:** [Creator name] (@handle if applicable)
- **URL:** [Clean URL â tracking params stripped]
- **Date:** [Publication date]
- **Length:** [Duration or word count]
- **Shared by:** [Who shared it] via [channel], [date]
- **Context:** [1-2 sentences: what was happening when the link was shared â what project, conversation, or train of thought prompted it. This is for future memory recall.]
Filing rules by verdict:
| Verdict | Destination | Retention |
|---|---|---|
| ð Read | docs/content-claw/caught/{cat}--{src}--{slug}.md |
Permanent |
| ð¬ Watch | docs/content-claw/caught/{cat}--{src}--{slug}.md |
Permanent |
| ð Skim | docs/content-claw/released/skim/{cat}--{src}--{slug}.md |
Monthly batch review |
| âï¸ Skip | docs/content-claw/released/skip/{cat}--{src}--{slug}.md |
Auto-purge after 14 days |
- Naming:
{category}--{source-type}--{title-slug}.md(see top of file for categories/sources) - Create directories as needed
- Never overwrite existing files â append a number if slug already exists
Review file contents:
- Source block (as above)
- Verdict and reasoning
- Full extracted content or detailed summary
- Actions and insights (separated)
Step 6: Released Folder Maintenance ð§¹
Auto-purge (agent-driven, no human input needed):
- Files in
released/skip/older than 14 days â delete automatically during heartbeats or periodic maintenance - No confirmation needed. Skips were definitively not worth it.
Monthly batch review (low-friction human decision):
- Once per month, scan
released/skim/for accumulated reviews - Surface a numbered summary to the user:
ð¦ Content Claw â Monthly Purge [N] skims from [month]. Promote or release?
- “Article title” â one-line context reminder
- “Article title” â one-line context reminder …
Reply with numbers to promote to
caught/, or “clear all”
- Promoted files move to
docs/content-claw/caught/ - Remaining files get deleted
- Track last purge date to avoid double-prompting
Known Limitations
- YouTube bot detection: YouTube blocks yt-dlp from datacenter/cloud IPs. If you’re running on a VPS or in a sandbox, yt-dlp will likely fail with 403 errors. Use web-based transcript extraction (Tier 1) instead.
- Rate limiting: web_search providers may rate-limit after repeated queries. Space out searches or reduce query count.
- Long videos (>2hr): Whisper transcription is CPU-intensive. For very long content, prefer searching for existing transcripts.
Notes
- Keep verdicts concise â mobile-friendly formatting (no tables, use bullet lists)
- The user’s time is the scarcest resource. Default to extracting value, not recommending consumption.
- When in doubt, extract the insights and skip. The bar for “watch the whole thing” should be high.
- This skill works for any agent â it adapts to whatever knowledge base and doc structure already exists.