scrapegraphai

📁 scrapegraphai/skill 📅 4 days ago

总安装量

周安装量

#54338

全站排名

安装命令

npx skills add https://github.com/scrapegraphai/skill --skill scrapegraphai

Agent 安装分布

amp 1

opencode 1

kimi-cli 1

codex 1

gemini-cli 1

Skill 文档

ScrapeGraphAI

AI-powered web scraping and data extraction API.

API Base URL: https://api.scrapegraphai.com/v1
Dashboard: https://dashboard.scrapegraphai.com
Docs: https://docs.scrapegraphai.com
Every API call costs credits. Always check credit balance before large operations.
For full parameter tables, SDK examples, and advanced features, see references/ files.

Authentication

All requests require the SGAI-APIKEY header. The API key should be stored in the SGAI_API_KEY environment variable.

# Check if API key is set
echo $SGAI_API_KEY

# Validate API key
curl -s https://api.scrapegraphai.com/v1/validate \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

# Check remaining credits
curl -s https://api.scrapegraphai.com/v1/credits \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

If the key is not set, tell the user to get one from https://dashboard.scrapegraphai.com and set it:

export SGAI_API_KEY="sgai-..."

Quick Decision Guide

Need	Endpoint	Credits
Extract structured data from a URL	`POST /v1/smartscraper`	~10/page + extras
Search the web and extract	`POST /v1/searchscraper`	10/site (AI) or 2/site (md)
Convert a page to markdown	`POST /v1/markdownify`	~2 + extras
Crawl multiple pages from a site	`POST /v1/crawl`	10/page (AI) or 2/page (md)
Get all sitemap URLs	`POST /v1/sitemap`	1
Browser automation (login, navigate)	`POST /v1/agentic-scrapper`	10 + 1/step
Get raw HTML from a URL	`POST /v1/scrape`	2 + extras
Generate a JSON schema from a prompt	`POST /v1/generate_schema`	varies

Credit extras: stealth +4, render_heavy_js +1, number_of_scrolls +1/scroll, branding +2 (scrape only).

API Reference

SmartScraper â Extract structured data from a URL

POST /v1/smartscraper â Start extraction job (async). GET /v1/smartscraper/{request_id} â Poll for results.

Key parameters:

website_url (string) â URL to scrape. Required unless providing website_html or website_markdown.
user_prompt (string, required) â What to extract (e.g., “Extract all product names and prices”).
output_schema (object) â JSON Schema for structured output.
number_of_scrolls (int, 0-100, default 0) â Infinite scroll count. Values 1-9 auto-set to 10.
total_pages (int, 1-100, default 1) â Pages to scrape if pagination detected.
render_heavy_js (bool, default false) â Enable full JS rendering (+1 credit).
stealth (bool, default false) â Bypass bot detection (+4 credits).
cookies (object) â Cookies as key-value pairs.
headers (object) â Custom HTTP headers.
plain_text (bool, default false) â Return plain text instead of JSON.
webhook_url (string) â URL for async result delivery.

# Start a SmartScraper job
curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://example.com/products",
    "user_prompt": "Extract all product names, prices, and descriptions"
  }'
# Response: { "request_id": "uuid-here", "status": "queued", ... }

# Poll for results
curl -s https://api.scrapegraphai.com/v1/smartscraper/REQUEST_ID \
  -H "SGAI-APIKEY: $SGAI_API_KEY"

With output schema:

curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "website_url": "https://example.com/products",
    "user_prompt": "Extract all products",
    "output_schema": {
      "type": "object",
      "properties": {
        "products": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": { "type": "string" },
              "price": { "type": "number" }
            }
          }
        }
      }
    }
  }'

SearchScraper â Search the web and extract data

POST /v1/searchscraper â Start search+extract job (async). GET /v1/searchscraper/{request_id} â Poll for results.

Key parameters:

user_prompt (string, required) â Search query and extraction instructions.
num_results (int, 3-20, default 3) â Number of websites to scrape.
extraction_mode (bool, default true) â true = AI extraction (10 credits/site), false = markdown only (2 credits/site).
output_schema (object) â JSON Schema for structured output.
stealth (bool, default false) â Bypass bot detection (+4 credits).
headers (object) â Custom HTTP headers.
webhook_url (string) â URL for async result delivery.

# Search and extract with AI
curl -s -X POST https://api.scrapegraphai.com/v1/searchscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_prompt": "What are the top 3 Python web frameworks in 2025?",
    "num_results": 5
  }'

# Search and get markdown only (cheaper)
curl -s -X POST https://api.scrapegraphai.com/v1/searchscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "user_prompt": "Python web frameworks comparison",
    "num_results": 3,
    "extraction_mode": false
  }'

Markdownify â Convert a webpage to markdown

POST /v1/markdownify â Start conversion job (async). GET /v1/markdownify/{request_id} â Poll for results.

Key parameters:

website_url (string, required) â URL to convert.
render_heavy_js (bool, default false) â Enable JS rendering (+1 credit).
stealth (bool, default false) â Bypass bot detection (+4 credits).
headers (object) â Custom HTTP headers.
webhook_url (string) â URL for async result delivery.

curl -s -X POST https://api.scrapegraphai.com/v1/markdownify \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "website_url": "https://example.com/article" }'

Crawler â Crawl and extract from multiple pages

POST /v1/crawl â Start crawl job (async). GET /v1/crawl/{task_id} â Poll for results.

Key parameters:

url (string, required) â Starting URL.
prompt (string) â Extraction prompt. Required when extraction_mode is true.
extraction_mode (bool, default true) â true = AI extraction (10 credits/page), false = markdown (2 credits/page).
max_pages (int, default 10) â Maximum pages to crawl.
depth (int, default 1) â Crawl depth.
schema (object) â JSON Schema for structured output.
rules (object) â Crawl rules:
- same_domain (bool, default true) â Stay on same domain.
- include_paths (array of strings) â Path patterns to include (e.g., ["/blog/*"]).
- exclude_paths (array of strings) â Path patterns to exclude.
sitemap (bool, default true) â Use sitemap for URL discovery.
render_heavy_js (bool, default false) â Enable JS rendering (+1 credit/page).
stealth (bool, default false) â Bypass bot detection (+4 credits).
webhook_url (string) â URL for async result delivery.

# Crawl with AI extraction
curl -s -X POST https://api.scrapegraphai.com/v1/crawl \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "prompt": "Extract article title and summary from each page",
    "max_pages": 5,
    "depth": 2,
    "rules": {
      "include_paths": ["/blog/*"],
      "same_domain": true
    }
  }'

# Crawl as markdown (cheaper)
curl -s -X POST https://api.scrapegraphai.com/v1/crawl \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "extraction_mode": false,
    "max_pages": 5,
    "depth": 1
  }'

Sitemap â Get all URLs from a website’s sitemap

POST /v1/sitemap â Returns sitemap URLs synchronously.

Key parameters:

website_url (string, required) â URL of the website.

curl -s -X POST https://api.scrapegraphai.com/v1/sitemap \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "website_url": "https://example.com" }'

Scrape â Get raw HTML content from a URL

POST /v1/scrape â Returns scraped HTML synchronously.

Key parameters:

website_url (string, required) â URL to scrape.
render_heavy_js (bool, default false) â Enable JS rendering (+1 credit).
stealth (bool, default false) â Bypass bot detection (+4 credits).
branding (bool, default false) â Extract branding info (+2 credits).
country_code (string) â ISO country code for geo-targeting.

curl -s -X POST https://api.scrapegraphai.com/v1/scrape \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "website_url": "https://example.com", "stealth": true }'

Agentic Scraper â Browser automation with AI

POST /v1/agentic-scrapper â Run browser automation steps.

Key parameters:

url (string, required) â Starting URL.
steps (array of strings) â Browser action instructions (e.g., ["Click the login button", "Fill email with user@test.com"]).
user_prompt (string) â Extraction prompt (used with ai_extraction).
output_schema (object) â JSON Schema for structured output.
ai_extraction (bool, default false) â Enable AI extraction after steps.
use_session (bool, default false) â Persist browser session across requests.

Credits: 10 base + 1 per step.

curl -s -X POST https://api.scrapegraphai.com/v1/agentic-scrapper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/login",
    "steps": [
      "Fill the email field with user@example.com",
      "Fill the password field with password123",
      "Click the Sign In button"
    ],
    "ai_extraction": true,
    "user_prompt": "Extract the dashboard data after login"
  }'

Generate Schema â Generate JSON Schema from a prompt

POST /v1/generate_schema â Start schema generation (async). GET /v1/generate_schema/{request_id} â Poll for results.

Key parameters:

user_prompt (string, required) â Describe the schema you need.
existing_schema (object) â Existing schema to modify.

curl -s -X POST https://api.scrapegraphai.com/v1/generate_schema \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "user_prompt": "Schema for an e-commerce product with name, price, rating, and reviews" }'

Credits â Check credit balance

GET /v1/credits

curl -s https://api.scrapegraphai.com/v1/credits \
  -H "SGAI-APIKEY: $SGAI_API_KEY"
# Response: { "remaining_credits": 1000, "total_credits_used": 500 }

Async Polling Pattern

Most endpoints are async: POST to start a job, then poll GET with the request_id until status is "completed" or "failed".

# 1. Start the job
RESPONSE=$(curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
  -H "SGAI-APIKEY: $SGAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{ "website_url": "https://example.com", "user_prompt": "Extract the main heading" }')

REQUEST_ID=$(echo "$RESPONSE" | jq -r '.request_id')

# 2. Poll until completed or failed
while true; do
  RESULT=$(curl -s https://api.scrapegraphai.com/v1/smartscraper/$REQUEST_ID \
    -H "SGAI-APIKEY: $SGAI_API_KEY")
  STATUS=$(echo "$RESULT" | jq -r '.status')
  if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
    echo "$RESULT" | jq .
    break
  fi
  sleep 3
done

Status values: queued â processing â completed | failed

This pattern applies to: SmartScraper, SearchScraper, Markdownify, Crawler, Generate Schema.

Synchronous endpoints (no polling needed): Sitemap, Scrape, Credits, Validate.

Key Guidelines

Always check credits first before large operations (crawls, multi-page scrapes):

curl -s https://api.scrapegraphai.com/v1/credits -H "SGAI-APIKEY: $SGAI_API_KEY"

Never hardcode API keys â always use $SGAI_API_KEY.
Use output_schema for structured data extraction â it ensures consistent JSON output.
Prefer markdown mode when raw content suffices â set extraction_mode: false to use 2 credits/page instead of 10 for SearchScraper and Crawler.
Use stealth only when needed â it adds 4 credits per request. Try without it first; enable if you get blocked.
Start crawls small â use low max_pages and depth first, then scale up.

Save results to files â write scraping results to a .scrapegraph/ directory:

mkdir -p .scrapegraph
# Save result to file
echo "$RESULT" | jq . > .scrapegraph/products.json

Add .scrapegraph/ to .gitignore to avoid committing scrape results.
Use generate_schema endpoint when users need help defining output schemas.
For SDK usage (Python/JS projects), see references/sdk-examples.md.
For advanced features (scrolling, pagination, geo-targeting, cookies), see references/advanced-features.md.
For full parameter tables, see references/api-endpoints.md.

Error Reference

Status	Meaning	Action
401	Invalid or missing API key	Check `$SGAI_API_KEY` is set correctly
402	Insufficient credits	User needs to purchase more credits at dashboard
422	Invalid parameters	Check request body against parameter docs
429	Rate limited (10 req/min default)	Wait and retry with backoff
500	Server error	Retry after a short delay

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台

scrapegraphai

Agent 安装分布

Skill 文档

ScrapeGraphAI

Authentication

Quick Decision Guide

API Reference

SmartScraper â Extract structured data from a URL

SearchScraper â Search the web and extract data

Markdownify â Convert a webpage to markdown

Crawler â Crawl and extract from multiple pages

Sitemap â Get all URLs from a website’s sitemap

Scrape â Get raw HTML content from a URL

Agentic Scraper â Browser automation with AI

Generate Schema â Generate JSON Schema from a prompt

Credits â Check credit balance

Async Polling Pattern

Key Guidelines

Error Reference

SmartScraper â Extract structured data from a URL

SearchScraper â Search the web and extract data

Markdownify â Convert a webpage to markdown

Crawler â Crawl and extract from multiple pages

Sitemap â Get all URLs from a website’s sitemap

Scrape â Get raw HTML content from a URL

Agentic Scraper â Browser automation with AI

Generate Schema â Generate JSON Schema from a prompt

Credits â Check credit balance