scrape-webpage

📁 adobe/helix-website 📅 7 days ago

总安装量

周安装量

#47378

全站排名

安装命令

npx skills add https://github.com/adobe/helix-website --skill scrape-webpage

Agent 安装分布

mcpjam 1

claude-code 1

replit 1

junie 1

zencoder 1

Skill 文档

Scrape Webpage

Extract content, metadata, and images from a webpage for import/migration.

When to Use This Skill

Use this skill when:

Starting a page import and need to extract content from source URL
Need webpage analysis with local image downloads
Want metadata extraction (Open Graph, JSON-LD, etc.)

Invoked by: page-import skill (Step 1)

Prerequisites

Before using this skill, ensure:

â Node.js is available
â npm playwright is installed (npm install playwright)
â Chromium browser is installed (npx playwright install chromium)
â Sharp image library is installed (cd .claude/skills/scrape-webpage/scripts && npm install)

Related Skills

page-import – Orchestrator that invokes this skill
identify-page-structure – Uses this skill’s output (screenshot, HTML, metadata)
generate-import-html – Uses image mapping and paths from this skill

Scraping Workflow

Step 1: Run Analysis Script

Command:

node .claude/skills/scrape-webpage/scripts/analyze-webpage.js "https://example.com/page" --output ./import-work

What the script does:

Sets up network interception to capture all images
Loads page in headless Chromium
Scrolls through entire page to trigger lazy-loaded images
Downloads all images locally (converts WebP/AVIF/SVG to PNG)
Captures full-page screenshot for visual reference
Extracts metadata (title, description, Open Graph, JSON-LD, canonical)
Fixes images in DOM (background-imageâimg, picture elements, srcsetâsrc, relativeâabsolute, inline SVGâimg)
Extracts cleaned HTML (removes scripts/styles)
Replaces image URLs in HTML with local paths (./images/…)
Generates document paths (sanitized, lowercase, no .html extension)
Saves complete analysis with image mapping to metadata.json

For detailed explanation: See resources/web-page-analysis.md

Step 2: Verify Output

Output files:

./import-work/metadata.json – Complete analysis with paths and image mapping
./import-work/screenshot.png – Visual reference for layout comparison
./import-work/cleaned.html – Main content HTML with local image paths
./import-work/images/ – All downloaded images (WebP/AVIF/SVG converted to PNG)

Verify files exist:

ls -lh ./import-work/metadata.json ./import-work/screenshot.png ./import-work/cleaned.html
ls -lh ./import-work/images/ | head -5

Step 3: Review Metadata JSON

Output JSON structure:

{
  "url": "https://example.com/page",
  "timestamp": "2025-01-12T10:30:00.000Z",
  "paths": {
    "documentPath": "/us/en/about",
    "htmlFilePath": "us/en/about.plain.html",
    "mdFilePath": "us/en/about.md",
    "dirPath": "us/en",
    "filename": "about"
  },
  "screenshot": "./import-work/screenshot.png",
  "html": {
    "filePath": "./import-work/cleaned.html",
    "size": 45230
  },
  "metadata": {
    "title": "Page Title",
    "description": "Page description",
    "og:image": "https://example.com/image.jpg",
    "canonical": "https://example.com/page"
  },
  "images": {
    "count": 15,
    "mapping": {
      "https://example.com/hero.jpg": "./images/a1b2c3d4e5f6.jpg",
      "https://example.com/logo.webp": "./images/f6e5d4c3b2a1.png"
    },
    "stats": {
      "total": 15,
      "converted": 3,
      "skipped": 12,
      "failed": 0
    }
  }
}

Key fields:

paths.documentPath – Used for browser preview URL
paths.htmlFilePath – Where to save final HTML file
images.mapping – Original URLs â local paths
metadata – Extracted page metadata

Output

This skill provides:

â metadata.json with paths, metadata, image mapping
â screenshot.png for visual reference
â cleaned.html with local image references
â images/ folder with all downloaded images

Next step: Pass these outputs to identify-page-structure skill

Troubleshooting

Browser not installed:

npx playwright install chromium

Sharp not installed:

cd .claude/skills/scrape-webpage/scripts && npm install

Image download failures:

Check images.stats.failed count in metadata.json
Some images may require authentication or be blocked by CORS
Failed images will be noted but won’t stop the scraping process

Lazy-loaded images not captured:

Script scrolls through page to trigger lazy loading
Some advanced lazy-loading may need customization in scripts/analyze-webpage.js

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台