scrapegraphai
npx skills add https://github.com/scrapegraphai/skill --skill scrapegraphai
Agent 安装分布
Skill 文档
ScrapeGraphAI
AI-powered web scraping and data extraction API.
- API Base URL:
https://api.scrapegraphai.com/v1 - Dashboard:
https://dashboard.scrapegraphai.com - Docs:
https://docs.scrapegraphai.com - Every API call costs credits. Always check credit balance before large operations.
- For full parameter tables, SDK examples, and advanced features, see
references/files.
Authentication
All requests require the SGAI-APIKEY header. The API key should be stored in the SGAI_API_KEY environment variable.
# Check if API key is set
echo $SGAI_API_KEY
# Validate API key
curl -s https://api.scrapegraphai.com/v1/validate \
-H "SGAI-APIKEY: $SGAI_API_KEY"
# Check remaining credits
curl -s https://api.scrapegraphai.com/v1/credits \
-H "SGAI-APIKEY: $SGAI_API_KEY"
If the key is not set, tell the user to get one from https://dashboard.scrapegraphai.com and set it:
export SGAI_API_KEY="sgai-..."
Quick Decision Guide
| Need | Endpoint | Credits |
|---|---|---|
| Extract structured data from a URL | POST /v1/smartscraper |
~10/page + extras |
| Search the web and extract | POST /v1/searchscraper |
10/site (AI) or 2/site (md) |
| Convert a page to markdown | POST /v1/markdownify |
~2 + extras |
| Crawl multiple pages from a site | POST /v1/crawl |
10/page (AI) or 2/page (md) |
| Get all sitemap URLs | POST /v1/sitemap |
1 |
| Browser automation (login, navigate) | POST /v1/agentic-scrapper |
10 + 1/step |
| Get raw HTML from a URL | POST /v1/scrape |
2 + extras |
| Generate a JSON schema from a prompt | POST /v1/generate_schema |
varies |
Credit extras: stealth +4, render_heavy_js +1, number_of_scrolls +1/scroll, branding +2 (scrape only).
API Reference
SmartScraper â Extract structured data from a URL
POST /v1/smartscraper â Start extraction job (async).
GET /v1/smartscraper/{request_id} â Poll for results.
Key parameters:
website_url(string) â URL to scrape. Required unless providingwebsite_htmlorwebsite_markdown.user_prompt(string, required) â What to extract (e.g., “Extract all product names and prices”).output_schema(object) â JSON Schema for structured output.number_of_scrolls(int, 0-100, default 0) â Infinite scroll count. Values 1-9 auto-set to 10.total_pages(int, 1-100, default 1) â Pages to scrape if pagination detected.render_heavy_js(bool, default false) â Enable full JS rendering (+1 credit).stealth(bool, default false) â Bypass bot detection (+4 credits).cookies(object) â Cookies as key-value pairs.headers(object) â Custom HTTP headers.plain_text(bool, default false) â Return plain text instead of JSON.webhook_url(string) â URL for async result delivery.
# Start a SmartScraper job
curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://example.com/products",
"user_prompt": "Extract all product names, prices, and descriptions"
}'
# Response: { "request_id": "uuid-here", "status": "queued", ... }
# Poll for results
curl -s https://api.scrapegraphai.com/v1/smartscraper/REQUEST_ID \
-H "SGAI-APIKEY: $SGAI_API_KEY"
With output schema:
curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"website_url": "https://example.com/products",
"user_prompt": "Extract all products",
"output_schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" }
}
}
}
}
}
}'
SearchScraper â Search the web and extract data
POST /v1/searchscraper â Start search+extract job (async).
GET /v1/searchscraper/{request_id} â Poll for results.
Key parameters:
user_prompt(string, required) â Search query and extraction instructions.num_results(int, 3-20, default 3) â Number of websites to scrape.extraction_mode(bool, default true) â true = AI extraction (10 credits/site), false = markdown only (2 credits/site).output_schema(object) â JSON Schema for structured output.stealth(bool, default false) â Bypass bot detection (+4 credits).headers(object) â Custom HTTP headers.webhook_url(string) â URL for async result delivery.
# Search and extract with AI
curl -s -X POST https://api.scrapegraphai.com/v1/searchscraper \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"user_prompt": "What are the top 3 Python web frameworks in 2025?",
"num_results": 5
}'
# Search and get markdown only (cheaper)
curl -s -X POST https://api.scrapegraphai.com/v1/searchscraper \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"user_prompt": "Python web frameworks comparison",
"num_results": 3,
"extraction_mode": false
}'
Markdownify â Convert a webpage to markdown
POST /v1/markdownify â Start conversion job (async).
GET /v1/markdownify/{request_id} â Poll for results.
Key parameters:
website_url(string, required) â URL to convert.render_heavy_js(bool, default false) â Enable JS rendering (+1 credit).stealth(bool, default false) â Bypass bot detection (+4 credits).headers(object) â Custom HTTP headers.webhook_url(string) â URL for async result delivery.
curl -s -X POST https://api.scrapegraphai.com/v1/markdownify \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "website_url": "https://example.com/article" }'
Crawler â Crawl and extract from multiple pages
POST /v1/crawl â Start crawl job (async).
GET /v1/crawl/{task_id} â Poll for results.
Key parameters:
url(string, required) â Starting URL.prompt(string) â Extraction prompt. Required whenextraction_modeis true.extraction_mode(bool, default true) â true = AI extraction (10 credits/page), false = markdown (2 credits/page).max_pages(int, default 10) â Maximum pages to crawl.depth(int, default 1) â Crawl depth.schema(object) â JSON Schema for structured output.rules(object) â Crawl rules:same_domain(bool, default true) â Stay on same domain.include_paths(array of strings) â Path patterns to include (e.g.,["/blog/*"]).exclude_paths(array of strings) â Path patterns to exclude.
sitemap(bool, default true) â Use sitemap for URL discovery.render_heavy_js(bool, default false) â Enable JS rendering (+1 credit/page).stealth(bool, default false) â Bypass bot detection (+4 credits).webhook_url(string) â URL for async result delivery.
# Crawl with AI extraction
curl -s -X POST https://api.scrapegraphai.com/v1/crawl \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"prompt": "Extract article title and summary from each page",
"max_pages": 5,
"depth": 2,
"rules": {
"include_paths": ["/blog/*"],
"same_domain": true
}
}'
# Crawl as markdown (cheaper)
curl -s -X POST https://api.scrapegraphai.com/v1/crawl \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"extraction_mode": false,
"max_pages": 5,
"depth": 1
}'
Sitemap â Get all URLs from a website’s sitemap
POST /v1/sitemap â Returns sitemap URLs synchronously.
Key parameters:
website_url(string, required) â URL of the website.
curl -s -X POST https://api.scrapegraphai.com/v1/sitemap \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "website_url": "https://example.com" }'
Scrape â Get raw HTML content from a URL
POST /v1/scrape â Returns scraped HTML synchronously.
Key parameters:
website_url(string, required) â URL to scrape.render_heavy_js(bool, default false) â Enable JS rendering (+1 credit).stealth(bool, default false) â Bypass bot detection (+4 credits).branding(bool, default false) â Extract branding info (+2 credits).country_code(string) â ISO country code for geo-targeting.
curl -s -X POST https://api.scrapegraphai.com/v1/scrape \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "website_url": "https://example.com", "stealth": true }'
Agentic Scraper â Browser automation with AI
POST /v1/agentic-scrapper â Run browser automation steps.
Key parameters:
url(string, required) â Starting URL.steps(array of strings) â Browser action instructions (e.g.,["Click the login button", "Fill email with user@test.com"]).user_prompt(string) â Extraction prompt (used withai_extraction).output_schema(object) â JSON Schema for structured output.ai_extraction(bool, default false) â Enable AI extraction after steps.use_session(bool, default false) â Persist browser session across requests.
Credits: 10 base + 1 per step.
curl -s -X POST https://api.scrapegraphai.com/v1/agentic-scrapper \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/login",
"steps": [
"Fill the email field with user@example.com",
"Fill the password field with password123",
"Click the Sign In button"
],
"ai_extraction": true,
"user_prompt": "Extract the dashboard data after login"
}'
Generate Schema â Generate JSON Schema from a prompt
POST /v1/generate_schema â Start schema generation (async).
GET /v1/generate_schema/{request_id} â Poll for results.
Key parameters:
user_prompt(string, required) â Describe the schema you need.existing_schema(object) â Existing schema to modify.
curl -s -X POST https://api.scrapegraphai.com/v1/generate_schema \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "user_prompt": "Schema for an e-commerce product with name, price, rating, and reviews" }'
Credits â Check credit balance
GET /v1/credits
curl -s https://api.scrapegraphai.com/v1/credits \
-H "SGAI-APIKEY: $SGAI_API_KEY"
# Response: { "remaining_credits": 1000, "total_credits_used": 500 }
Async Polling Pattern
Most endpoints are async: POST to start a job, then poll GET with the request_id until status is "completed" or "failed".
# 1. Start the job
RESPONSE=$(curl -s -X POST https://api.scrapegraphai.com/v1/smartscraper \
-H "SGAI-APIKEY: $SGAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "website_url": "https://example.com", "user_prompt": "Extract the main heading" }')
REQUEST_ID=$(echo "$RESPONSE" | jq -r '.request_id')
# 2. Poll until completed or failed
while true; do
RESULT=$(curl -s https://api.scrapegraphai.com/v1/smartscraper/$REQUEST_ID \
-H "SGAI-APIKEY: $SGAI_API_KEY")
STATUS=$(echo "$RESULT" | jq -r '.status')
if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
echo "$RESULT" | jq .
break
fi
sleep 3
done
Status values: queued â processing â completed | failed
This pattern applies to: SmartScraper, SearchScraper, Markdownify, Crawler, Generate Schema.
Synchronous endpoints (no polling needed): Sitemap, Scrape, Credits, Validate.
Key Guidelines
-
Always check credits first before large operations (crawls, multi-page scrapes):
curl -s https://api.scrapegraphai.com/v1/credits -H "SGAI-APIKEY: $SGAI_API_KEY" -
Never hardcode API keys â always use
$SGAI_API_KEY. -
Use
output_schemafor structured data extraction â it ensures consistent JSON output. -
Prefer markdown mode when raw content suffices â set
extraction_mode: falseto use 2 credits/page instead of 10 for SearchScraper and Crawler. -
Use stealth only when needed â it adds 4 credits per request. Try without it first; enable if you get blocked.
-
Start crawls small â use low
max_pagesanddepthfirst, then scale up. -
Save results to files â write scraping results to a
.scrapegraph/directory:mkdir -p .scrapegraph # Save result to file echo "$RESULT" | jq . > .scrapegraph/products.json -
Add
.scrapegraph/to.gitignoreto avoid committing scrape results. -
Use
generate_schemaendpoint when users need help defining output schemas. -
For SDK usage (Python/JS projects), see
references/sdk-examples.md. -
For advanced features (scrolling, pagination, geo-targeting, cookies), see
references/advanced-features.md. -
For full parameter tables, see
references/api-endpoints.md.
Error Reference
| Status | Meaning | Action |
|---|---|---|
| 401 | Invalid or missing API key | Check $SGAI_API_KEY is set correctly |
| 402 | Insufficient credits | User needs to purchase more credits at dashboard |
| 422 | Invalid parameters | Check request body against parameter docs |
| 429 | Rate limited (10 req/min default) | Wait and retry with backoff |
| 500 | Server error | Retry after a short delay |