content-extraction
npx skills add https://github.com/saccoai/agent-skills --skill content-extraction
Agent 安装分布
Skill 文档
This skill extracts ALL content from an existing website and outputs it as structured, reusable data files. It crawls every page, downloads every asset, and produces a complete content inventory.
The user provides: the URL of the site to extract from, and optionally the target format (TypeScript, JSON, Markdown).
What Gets Extracted
For each page on the site:
| Content type | Output |
|---|---|
| Text | Headings, paragraphs, lists, quotes â preserved with hierarchy |
| Metadata | <title>, <meta description>, OG tags, canonical URL, lang |
| Images | Downloaded to public/images/ with original filenames. Alt text cataloged |
| Links | Internal + external, with anchor text and destination URL |
| PDFs & assets | Downloaded to public/assets/. Filenames and original URLs cataloged |
| Forms | Field names, types, labels, validation rules, action URLs |
| Navigation | Menu structure, link hierarchy, active states |
| Structured data | JSON-LD, microdata, schema.org markup |
Extraction Process
Step 1: Discover all pages
Use browser automation (agent-browser or Playwright) to:
1. Start at the site root
2. Extract all internal links from navigation, footer, sitemap.xml
3. Follow every internal link recursively
4. Build a complete page list with URLs
5. Detect and note any client-side routing (SPA)
Step 2: Extract content per page
For each discovered page:
1. Navigate to the page
2. Wait for full load (networkidle)
3. Extract the DOM structure:
- All headings (h1-h6) with hierarchy
- All paragraphs and text blocks
- All lists (ordered/unordered)
- All images (src, alt, dimensions)
- All links (href, text, target)
- All forms (fields, labels, actions)
4. Extract <head> metadata
5. Take a full-page screenshot for reference
6. Save structured data to output files
Step 3: Download assets
1. Download all images to public/images/{page-slug}/
2. Download all PDFs to public/assets/
3. Download favicons, OG images, other static assets
4. Preserve original filenames where possible
5. Note any broken/404 asset URLs
Step 4: Generate output files
content-inventory.md
Human-readable summary of everything extracted:
# Content Inventory â {site-name}
## Pages ({count})
### / (Homepage)
- Title: "Site Name â Tagline"
- Description: "Meta description here"
- H1: "Main Heading"
- Sections: Hero, Features (3), CTA, Testimonials (4)
- Images: 8 (hero.jpg, feature-1.png, ...)
- Links: 12 internal, 3 external
- Forms: none
### /about
...
src/data/*.ts (TypeScript format)
Structured data files ready for import:
// src/data/pages.ts
export interface Page {
slug: string;
url: string;
title: string;
description: string;
headings: { level: number; text: string }[];
sections: Section[];
}
// src/data/navigation.ts
export interface NavItem {
label: string;
href: string;
children?: NavItem[];
}
// src/data/images.ts
export interface ImageAsset {
originalUrl: string;
localPath: string;
alt: string;
width?: number;
height?: number;
page: string;
}
screenshots/
Full-page screenshots of every page for visual reference.
Agent Team Integration
When used as a teammate in website-refactor, this skill runs as the content-extractor agent:
- Owns:
src/data/,scripts/extract/,public/images/,public/assets/,content-inventory.md,screenshots/ - Outputs: Structured data files that
designerandcontent-verifierteammates consume - Signals completion: By marking its task complete and confirming
content-inventory.mdis written
Standalone Usage
Can be invoked independently for:
- Content audits: “How much content does this site have?”
- Migration planning: “Extract everything from the old site before we rebuild”
- Competitive analysis: “What content does competitor X have?”
- Archival: “Save a complete copy of this site’s content”
Output Formats
The default output is TypeScript (.ts files). Pass format preference as an argument:
typescript(default) âsrc/data/*.tswith interfaces and typed exportsjsonâcontent/*.jsonfilesmarkdownâcontent/*.mdfiles with frontmatter
Common Pitfalls
- SPAs and client-side routing: Some sites don’t render content without JavaScript. Always use a real browser (Playwright/agent-browser), not HTTP fetches.
- Lazy-loaded content: Scroll the full page before extracting to trigger lazy-loaded images and infinite scroll sections.
- Authentication walls: Some content may be behind login. Note these pages as “requires auth” in the inventory.
- Rate limiting: Add delays between page fetches to avoid being blocked. 1-2 seconds between requests is safe.
- Duplicate content: The same content may appear on multiple pages (shared sections, footers). Deduplicate in the data files.
- Relative URLs: Always resolve to absolute URLs before cataloging or downloading.