scrapegraphai

📁 scrapegraphai/skill 📅 4 days ago
1
总安装量
1
周安装量
#54338
全站排名
安装命令
npx skills add https://github.com/scrapegraphai/skill --skill scrapegraphai

Agent 安装分布

amp 1
opencode 1
kimi-cli 1
codex 1
gemini-cli 1

Skill 文档

ScrapeGraphAI

AI-powered web scraping and data extraction API.

  • API Base URL: https://api.scrapegraphai.com/v1
  • Dashboard: https://dashboard.scrapegraphai.com
  • Docs: https://docs.scrapegraphai.com
  • Every API call costs credits. Always check credit balance before large operations.
  • For full parameter tables, SDK examples, and advanced features, see references/ files.

Authentication

All requests require the SGAI-APIKEY header. The API key should be stored in the SGAI_API_KEY environment variable.

# Check if API key is set
echo $SGAI_API_KEY

# Validate API key
curl -s https://api.scrapegraphai.com/v1/validate \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

# Check remaining credits
curl -s https://api.scrapegraphai.com/v1/credits \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

If the key is not set, tell the user to get one from https://dashboard.scrapegraphai.com and set it:

export SGAI_API_KEY="sgai-..."

Quick Decision Guide

Need Endpoint Credits
Extract structured data from a URL POST /v1/smartscraper ~10/page + extras
Search the web and extract POST /v1/searchscraper 10/site (AI) or 2/site (md)
Convert a page to markdown POST /v1/markdownify ~2 + extras
Crawl multiple pages from a site POST /v1/crawl 10/page (AI) or 2/page (md)
Get all sitemap URLs POST /v1/sitemap 1
Browser automation (login, navigate) POST /v1/agentic-scrapper 10 + 1/step
Get raw HTML from a URL POST /v1/scrape 2 + extras
Generate a JSON schema from a prompt POST /v1/generate_schema varies

Credit extras: stealth +4, render_heavy_js +1, number_of_scrolls +1/scroll, branding +2 (scrape only).

API Reference

SmartScraper — Extract structured data from a URL

POST /v1/smartscraper — Start extraction job (async). GET /v1/smartscraper/{request_id} — Poll for results.

Key parameters:

  • website_url (string) — URL to scrape. Required unless providing website_html or website_markdown.
  • user_prompt (string, required) — What to extract (e.g., “Extract all product names and prices”).
  • output_schema (object) — JSON Schema for structured output.
  • number_of_scrolls (int, 0-100, default 0) — Infinite scroll count. Values 1-9 auto-set to 10.
  • total_pages (int, 1-100, default 1) — Pages to scrape if pagination detected.
  • render_heavy_js (bool, default false) — Enable full JS rendering (+1 credit).
  • stealth (bool, default false) — Bypass bot detection (+4 credits).
  • cookies (object) — Cookies as key-value pairs.
  • headers (object) — Custom HTTP headers.
  • plain_text (bool, default false) — Return plain text instead of JSON.
  • webhook_url (string) — URL for async result delivery.
# Start a SmartScraper job
curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://example.com/products",
    "user_prompt": "Extract all product names, prices, and descriptions"
  }'
# Response: { "request_id": "uuid-here", "status": "queued", ... }

# Poll for results
curl -s https://api.scrapegraphai.com/v1/smartscraper/REQUEST_ID \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

With output schema:

curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://example.com/products",
    "user_prompt": "Extract all products",
    "output_schema": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "number" }
            }
          }
        }
      }
    }
  }'

SearchScraper — Search the web and extract data

POST /v1/searchscraper — Start search+extract job (async). GET /v1/searchscraper/{request_id} — Poll for results.

Key parameters:

  • user_prompt (string, required) — Search query and extraction instructions.
  • num_results (int, 3-20, default 3) — Number of websites to scrape.
  • extraction_mode (bool, default true) — true = AI extraction (10 credits/site), false = markdown only (2 credits/site).
  • output_schema (object) — JSON Schema for structured output.
  • stealth (bool, default false) — Bypass bot detection (+4 credits).
  • headers (object) — Custom HTTP headers.
  • webhook_url (string) — URL for async result delivery.
# Search and extract with AI
curl -s -X POST https://api.scrapegraphai.com/v1/searchscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_prompt": "What are the top 3 Python web frameworks in 2025?",
    "num_results": 5
  }'

# Search and get markdown only (cheaper)
curl -s -X POST https://api.scrapegraphai.com/v1/searchscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_prompt": "Python web frameworks comparison",
    "num_results": 3,
    "extraction_mode": false
  }'

Markdownify — Convert a webpage to markdown

POST /v1/markdownify — Start conversion job (async). GET /v1/markdownify/{request_id} — Poll for results.

Key parameters:

  • website_url (string, required) — URL to convert.
  • render_heavy_js (bool, default false) — Enable JS rendering (+1 credit).
  • stealth (bool, default false) — Bypass bot detection (+4 credits).
  • headers (object) — Custom HTTP headers.
  • webhook_url (string) — URL for async result delivery.
curl -s -X POST https://api.scrapegraphai.com/v1/markdownify \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "website_url": "https://example.com/article" }'

Crawler — Crawl and extract from multiple pages

POST /v1/crawl — Start crawl job (async). GET /v1/crawl/{task_id} — Poll for results.

Key parameters:

  • url (string, required) — Starting URL.
  • prompt (string) — Extraction prompt. Required when extraction_mode is true.
  • extraction_mode (bool, default true) — true = AI extraction (10 credits/page), false = markdown (2 credits/page).
  • max_pages (int, default 10) — Maximum pages to crawl.
  • depth (int, default 1) — Crawl depth.
  • schema (object) — JSON Schema for structured output.
  • rules (object) — Crawl rules:
    • same_domain (bool, default true) — Stay on same domain.
    • include_paths (array of strings) — Path patterns to include (e.g., ["/blog/*"]).
    • exclude_paths (array of strings) — Path patterns to exclude.
  • sitemap (bool, default true) — Use sitemap for URL discovery.
  • render_heavy_js (bool, default false) — Enable JS rendering (+1 credit/page).
  • stealth (bool, default false) — Bypass bot detection (+4 credits).
  • webhook_url (string) — URL for async result delivery.
# Crawl with AI extraction
curl -s -X POST https://api.scrapegraphai.com/v1/crawl \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "prompt": "Extract article title and summary from each page",
    "max_pages": 5,
    "depth": 2,
    "rules": {
      "include_paths": ["/blog/*"],
      "same_domain": true
    }
  }'

# Crawl as markdown (cheaper)
curl -s -X POST https://api.scrapegraphai.com/v1/crawl \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "extraction_mode": false,
    "max_pages": 5,
    "depth": 1
  }'

Sitemap — Get all URLs from a website’s sitemap

POST /v1/sitemap — Returns sitemap URLs synchronously.

Key parameters:

  • website_url (string, required) — URL of the website.
curl -s -X POST https://api.scrapegraphai.com/v1/sitemap \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "website_url": "https://example.com" }'

Scrape — Get raw HTML content from a URL

POST /v1/scrape — Returns scraped HTML synchronously.

Key parameters:

  • website_url (string, required) — URL to scrape.
  • render_heavy_js (bool, default false) — Enable JS rendering (+1 credit).
  • stealth (bool, default false) — Bypass bot detection (+4 credits).
  • branding (bool, default false) — Extract branding info (+2 credits).
  • country_code (string) — ISO country code for geo-targeting.
curl -s -X POST https://api.scrapegraphai.com/v1/scrape \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "website_url": "https://example.com", "stealth": true }'

Agentic Scraper — Browser automation with AI

POST /v1/agentic-scrapper — Run browser automation steps.

Key parameters:

  • url (string, required) — Starting URL.
  • steps (array of strings) — Browser action instructions (e.g., ["Click the login button", "Fill email with user@test.com"]).
  • user_prompt (string) — Extraction prompt (used with ai_extraction).
  • output_schema (object) — JSON Schema for structured output.
  • ai_extraction (bool, default false) — Enable AI extraction after steps.
  • use_session (bool, default false) — Persist browser session across requests.

Credits: 10 base + 1 per step.

curl -s -X POST https://api.scrapegraphai.com/v1/agentic-scrapper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/login",
    "steps": [
      "Fill the email field with user@example.com",
      "Fill the password field with password123",
      "Click the Sign In button"
    ],
    "ai_extraction": true,
    "user_prompt": "Extract the dashboard data after login"
  }'

Generate Schema — Generate JSON Schema from a prompt

POST /v1/generate_schema — Start schema generation (async). GET /v1/generate_schema/{request_id} — Poll for results.

Key parameters:

  • user_prompt (string, required) — Describe the schema you need.
  • existing_schema (object) — Existing schema to modify.
curl -s -X POST https://api.scrapegraphai.com/v1/generate_schema \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "user_prompt": "Schema for an e-commerce product with name, price, rating, and reviews" }'

Credits — Check credit balance

GET /v1/credits

curl -s https://api.scrapegraphai.com/v1/credits \
  -H "SGAI-APIKEY: $SGAI_API_KEY"
# Response: { "remaining_credits": 1000, "total_credits_used": 500 }

Async Polling Pattern

Most endpoints are async: POST to start a job, then poll GET with the request_id until status is "completed" or "failed".

# 1. Start the job
RESPONSE=$(curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "website_url": "https://example.com", "user_prompt": "Extract the main heading" }')

REQUEST_ID=$(echo "$RESPONSE" | jq -r '.request_id')

# 2. Poll until completed or failed
while true; do
  RESULT=$(curl -s https://api.scrapegraphai.com/v1/smartscraper/$REQUEST_ID \
    -H "SGAI-APIKEY: $SGAI_API_KEY")
  STATUS=$(echo "$RESULT" | jq -r '.status')
  if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
    echo "$RESULT" | jq .
    break
  fi
  sleep 3
done

Status values: queued → processing → completed | failed

This pattern applies to: SmartScraper, SearchScraper, Markdownify, Crawler, Generate Schema.

Synchronous endpoints (no polling needed): Sitemap, Scrape, Credits, Validate.

Key Guidelines

  1. Always check credits first before large operations (crawls, multi-page scrapes):

    curl -s https://api.scrapegraphai.com/v1/credits -H "SGAI-APIKEY: $SGAI_API_KEY"
    
  2. Never hardcode API keys — always use $SGAI_API_KEY.

  3. Use output_schema for structured data extraction — it ensures consistent JSON output.

  4. Prefer markdown mode when raw content suffices — set extraction_mode: false to use 2 credits/page instead of 10 for SearchScraper and Crawler.

  5. Use stealth only when needed — it adds 4 credits per request. Try without it first; enable if you get blocked.

  6. Start crawls small — use low max_pages and depth first, then scale up.

  7. Save results to files — write scraping results to a .scrapegraph/ directory:

    mkdir -p .scrapegraph
    # Save result to file
    echo "$RESULT" | jq . > .scrapegraph/products.json
    
  8. Add .scrapegraph/ to .gitignore to avoid committing scrape results.

  9. Use generate_schema endpoint when users need help defining output schemas.

  10. For SDK usage (Python/JS projects), see references/sdk-examples.md.

  11. For advanced features (scrolling, pagination, geo-targeting, cookies), see references/advanced-features.md.

  12. For full parameter tables, see references/api-endpoints.md.

Error Reference

Status Meaning Action
401 Invalid or missing API key Check $SGAI_API_KEY is set correctly
402 Insufficient credits User needs to purchase more credits at dashboard
422 Invalid parameters Check request body against parameter docs
429 Rate limited (10 req/min default) Wait and retry with backoff
500 Server error Retry after a short delay