firecrawl-scraper
npx skills add https://github.com/olino3/forge --skill firecrawl-scraper
Agent 安装分布
Skill 文档
skill:firecrawl-scraper – Convert Websites into LLM-Ready Data
Version: 1.0.0
Purpose
Convert websites into clean, LLM-ready data using the Firecrawl API. Supports six operation modes: scrape single pages, crawl entire sites, map site structure, search content, extract structured data with schemas, and agent-based autonomous scraping. Includes batch operations for high-volume scraping and change tracking to monitor content updates over time. Use when you need to ingest web content for AI processing, build training datasets, monitor competitor pages, or extract structured information from websites.
File Structure
skills/firecrawl-scraper/
âââ SKILL.md (this file)
âââ examples.md
Interface References
- Context: Loaded via ContextProvider Interface
- Memory: Accessed via MemoryStore Interface
- Shared Patterns: Shared Loading Patterns
- Schemas: Validated against context_metadata.schema.json and memory_entry.schema.json
Mandatory Workflow
IMPORTANT: Execute ALL steps in order. Do not skip any step.
Step 1: Initial Analysis
- Determine operation type:
scrape(single page),crawl(entire site),map(site structure),search(content search),extract(structured data), orbatch(bulk operations) - Identify target URLs and validate they are reachable
- Determine desired output format: markdown, HTML, links, screenshot, or raw HTML
- Assess scope: single page, specific paths, or full domain crawl
- Check if change tracking is needed (compare with previous scrapes)
Step 2: Load Memory
Follow Standard Memory Loading with
skill="firecrawl-scraper"anddomain="engineering".
Step 3: Load Context
Follow Standard Context Loading for the
engineeringdomain. Stay within the file budget declared in frontmatter.
Step 4: Configure Firecrawl Operation
- Verify Firecrawl API key is available (environment variable
FIRECRAWL_API_KEY) - Select the appropriate endpoint based on operation type:
/scrapeâ single page extraction/crawlâ recursive site crawling/mapâ site structure discovery/searchâ content search across a domain/extractâ structured data extraction with schema/batch/scrapeâ bulk page scraping
- Configure output format(s):
markdown,html,links,screenshot,rawHtml - Set include/exclude URL patterns (glob syntax:
blog/*,!*/admin/*) - Configure wait strategies for JavaScript-heavy sites:
waitFor: CSS selector or milliseconds to wait before capturetimeout: maximum page load time
- Set crawl depth, page limits, and rate limiting parameters
- Define extraction schema (JSON Schema format) if using
/extract
Step 5: Execute Scraping Operation
- Call the appropriate Firecrawl endpoint with configured parameters
- For
/crawland/batch/scrape: poll the async job status until completion - Handle rate limiting with exponential backoff (respect
429responses) - Capture response metadata: status codes, page count, execution time
- For
/map: collect all discovered URLs and their hierarchy - For
/search: rank and filter results by relevance - For
/extract: validate extracted data against the provided schema
Step 6: Process and Transform Results
- Clean extracted content: remove navigation chrome, ads, cookie banners
- Apply extraction schemas to structure raw content into typed fields
- Format output for LLM consumption:
- Markdown with preserved heading hierarchy
- Code blocks with language annotations
- Tables converted to markdown format
- Images as alt-text descriptions or URLs
- Handle pagination: aggregate multi-page crawl results into unified output
- Deduplicate content across crawled pages
- Add source metadata (URL, scrape timestamp, content hash)
Step 7: Implement Change Tracking
- Generate content hash (MD5/SHA-256) for each scraped page
- Compare current hashes against stored hashes in
crawl_history.md - For changed pages, generate a content diff showing additions and removals
- Classify changes: content update, structural change, new page, removed page
- Record change frequency to inform future monitoring schedules
- Flag significant changes (pricing updates, new features, policy changes)
Step 8: Generate Output
- Save output to
/claudedocs/firecrawl-scraper_{project}_{YYYY-MM-DD}.md - Follow naming conventions in
../OUTPUT_CONVENTIONS.md - Output includes:
- Scraped content in requested format (markdown/HTML/structured)
- Source URLs with timestamps
- Extraction results (if schema was used)
- Change tracking report (if comparing with previous scrapes)
- Error log for any failed pages
Step 9: Update Memory
Follow Standard Memory Update for
skill="firecrawl-scraper". Store scrape configurations, extraction schemas, and crawl history for future sessions.
10 Common Errors to Prevent
- Missing API key â Always verify
FIRECRAWL_API_KEYis set before making requests. Fail fast with a clear message if missing. - Ignoring rate limits â Firecrawl enforces rate limits. Implement exponential backoff on
429responses instead of hammering the API. - Unbounded crawls â Always set
maxDepthandlimitparameters on/crawlto prevent runaway jobs that scrape thousands of pages. - No wait strategy for SPAs â JavaScript-rendered sites return empty content without
waitFor. Use CSS selectors or delay to ensure content loads. - Scraping login-protected pages without auth â Pages behind authentication return login forms. Use
headersto pass session cookies or tokens. - Overly broad include patterns â Patterns like
*crawl everything including admin panels, CDN assets, and API endpoints. Be specific. - Ignoring robots.txt â Respect
robots.txtdirectives. Firecrawl handles this by default, but custom configurations can override it. - Not validating extraction schemas â Invalid JSON Schemas cause silent failures. Validate schemas before passing them to
/extract. - Storing raw HTML when markdown suffices â Raw HTML is noisy and wastes tokens. Default to markdown unless HTML structure is explicitly needed.
- No deduplication on crawl results â Sites with multiple URL paths to the same content produce duplicates. Hash content to detect and remove duplicates.
Compliance Checklist
Before completing, verify:
- All mandatory workflow steps executed in order
- Standard Memory Loading pattern followed (Step 2)
- Standard Context Loading pattern followed (Step 3)
- Firecrawl API key verified before first request
- Rate limiting and backoff strategy implemented
- Crawl boundaries set (maxDepth, limit, include/exclude patterns)
- Output saved with standard naming convention
- Change tracking hashes recorded (if applicable)
- Standard Memory Update pattern followed (Step 9)
Version History
| Version | Date | Changes |
|---|---|---|
| 1.0.0 | 2025-07-15 | Initial release â scrape, crawl, map, search, extract, batch, change tracking |