firecrawl-scraper

📁 olino3/forge 📅 Feb 13, 2026

总安装量

周安装量

#52657

全站排名

安装命令

npx skills add https://github.com/olino3/forge --skill firecrawl-scraper

Agent 安装分布

cursor 4

claude-code 4

replit 4

mcpjam 3

openhands 3

zencoder 3

Skill 文档

skill:firecrawl-scraper – Convert Websites into LLM-Ready Data

Version: 1.0.0

Purpose

Convert websites into clean, LLM-ready data using the Firecrawl API. Supports six operation modes: scrape single pages, crawl entire sites, map site structure, search content, extract structured data with schemas, and agent-based autonomous scraping. Includes batch operations for high-volume scraping and change tracking to monitor content updates over time. Use when you need to ingest web content for AI processing, build training datasets, monitor competitor pages, or extract structured information from websites.

File Structure

skills/firecrawl-scraper/
âââ SKILL.md (this file)
âââ examples.md

Interface References

Context: Loaded via ContextProvider Interface
Memory: Accessed via MemoryStore Interface
Shared Patterns: Shared Loading Patterns
Schemas: Validated against context_metadata.schema.json and memory_entry.schema.json

Mandatory Workflow

IMPORTANT: Execute ALL steps in order. Do not skip any step.

Step 1: Initial Analysis

Determine operation type: scrape (single page), crawl (entire site), map (site structure), search (content search), extract (structured data), or batch (bulk operations)
Identify target URLs and validate they are reachable
Determine desired output format: markdown, HTML, links, screenshot, or raw HTML
Assess scope: single page, specific paths, or full domain crawl
Check if change tracking is needed (compare with previous scrapes)

Step 2: Load Memory

Follow Standard Memory Loading with skill="firecrawl-scraper" and domain="engineering".

Step 3: Load Context

Follow Standard Context Loading for the engineering domain. Stay within the file budget declared in frontmatter.

Step 4: Configure Firecrawl Operation

Verify Firecrawl API key is available (environment variable FIRECRAWL_API_KEY)
Select the appropriate endpoint based on operation type:
- /scrape â single page extraction
- /crawl â recursive site crawling
- /map â site structure discovery
- /search â content search across a domain
- /extract â structured data extraction with schema
- /batch/scrape â bulk page scraping
Configure output format(s): markdown, html, links, screenshot, rawHtml
Set include/exclude URL patterns (glob syntax: blog/*, !*/admin/*)
Configure wait strategies for JavaScript-heavy sites:
- waitFor: CSS selector or milliseconds to wait before capture
- timeout: maximum page load time
Set crawl depth, page limits, and rate limiting parameters
Define extraction schema (JSON Schema format) if using /extract

Step 5: Execute Scraping Operation

Call the appropriate Firecrawl endpoint with configured parameters
For /crawl and /batch/scrape: poll the async job status until completion
Handle rate limiting with exponential backoff (respect 429 responses)
Capture response metadata: status codes, page count, execution time
For /map: collect all discovered URLs and their hierarchy
For /search: rank and filter results by relevance
For /extract: validate extracted data against the provided schema

Step 6: Process and Transform Results

Clean extracted content: remove navigation chrome, ads, cookie banners
Apply extraction schemas to structure raw content into typed fields
Format output for LLM consumption:
- Markdown with preserved heading hierarchy
- Code blocks with language annotations
- Tables converted to markdown format
- Images as alt-text descriptions or URLs
Handle pagination: aggregate multi-page crawl results into unified output
Deduplicate content across crawled pages
Add source metadata (URL, scrape timestamp, content hash)

Step 7: Implement Change Tracking

Generate content hash (MD5/SHA-256) for each scraped page
Compare current hashes against stored hashes in crawl_history.md
For changed pages, generate a content diff showing additions and removals
Classify changes: content update, structural change, new page, removed page
Record change frequency to inform future monitoring schedules
Flag significant changes (pricing updates, new features, policy changes)

Step 8: Generate Output

Save output to /claudedocs/firecrawl-scraper_{project}_{YYYY-MM-DD}.md
Follow naming conventions in ../OUTPUT_CONVENTIONS.md
Output includes:
- Scraped content in requested format (markdown/HTML/structured)
- Source URLs with timestamps
- Extraction results (if schema was used)
- Change tracking report (if comparing with previous scrapes)
- Error log for any failed pages

Step 9: Update Memory

Follow Standard Memory Update for skill="firecrawl-scraper". Store scrape configurations, extraction schemas, and crawl history for future sessions.

10 Common Errors to Prevent

Missing API key â Always verify FIRECRAWL_API_KEY is set before making requests. Fail fast with a clear message if missing.
Ignoring rate limits â Firecrawl enforces rate limits. Implement exponential backoff on 429 responses instead of hammering the API.
Unbounded crawls â Always set maxDepth and limit parameters on /crawl to prevent runaway jobs that scrape thousands of pages.
No wait strategy for SPAs â JavaScript-rendered sites return empty content without waitFor. Use CSS selectors or delay to ensure content loads.
Scraping login-protected pages without auth â Pages behind authentication return login forms. Use headers to pass session cookies or tokens.
Overly broad include patterns â Patterns like * crawl everything including admin panels, CDN assets, and API endpoints. Be specific.
Ignoring robots.txt â Respect robots.txt directives. Firecrawl handles this by default, but custom configurations can override it.
Not validating extraction schemas â Invalid JSON Schemas cause silent failures. Validate schemas before passing them to /extract.
Storing raw HTML when markdown suffices â Raw HTML is noisy and wastes tokens. Default to markdown unless HTML structure is explicitly needed.
No deduplication on crawl results â Sites with multiple URL paths to the same content produce duplicates. Hash content to detect and remove duplicates.

Compliance Checklist

Before completing, verify:

All mandatory workflow steps executed in order
Standard Memory Loading pattern followed (Step 2)
Standard Context Loading pattern followed (Step 3)
Firecrawl API key verified before first request
Rate limiting and backoff strategy implemented
Crawl boundaries set (maxDepth, limit, include/exclude patterns)
Output saved with standard naming convention
Change tracking hashes recorded (if applicable)
Standard Memory Update pattern followed (Step 9)

Version History

Version	Date	Changes
1.0.0	2025-07-15	Initial release â scrape, crawl, map, search, extract, batch, change tracking

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台