bright-data
npx skills add https://github.com/vm0-ai/vm0-skills --skill bright-data
Agent 安装分布
Skill 文档
Bright Data Web Scraper API
Use the Bright Data API via direct curl calls for social media scraping, web data extraction, and account management.
Official docs:
https://docs.brightdata.com/
When to Use
Use this skill when you need to:
- Scrape social media – Twitter/X, Reddit, YouTube, Instagram, TikTok, LinkedIn
- Extract web data – Posts, profiles, comments, engagement metrics
- Monitor usage – Track bandwidth and request usage
- Manage account – Check status and zones
Prerequisites
- Sign up at Bright Data
- Get your API key from Settings > Users
- Create a Web Scraper dataset in the Control Panel to get your
dataset_id
export BRIGHTDATA_API_KEY="your-api-key"
Base URL
https://api.brightdata.com
Important: When using
$VARin a command that pipes to another command, wrap the command containing$VARinbash -c '...'. Due to a Claude Code bug, environment variables are silently cleared when pipes are used directly.bash -c 'curl -s "https://api.example.com" -H "Authorization: Bearer $API_KEY"'
Social Media Scraping
Bright Data supports scraping these social media platforms:
| Platform | Profiles | Posts | Comments | Reels/Videos |
|---|---|---|---|---|
| Twitter/X | â | â | – | – |
| – | â | â | – | |
| YouTube | â | â | â | – |
| â | â | â | â | |
| TikTok | â | â | â | – |
| â | â | – | – |
How to Use
1. Trigger Scraping (Asynchronous)
Trigger a data collection job and get a snapshot_id for later retrieval.
Write to /tmp/brightdata_request.json:
[
{"url": "https://twitter.com/username"},
{"url": "https://twitter.com/username2"}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Response:
{
"snapshot_id": "s_m4x7enmven8djfqak"
}
2. Trigger Scraping (Synchronous)
Get results immediately in the response (for small requests).
Write to /tmp/brightdata_request.json:
[
{"url": "https://www.reddit.com/r/technology/comments/xxxxx"}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
3. Monitor Progress
Check the status of a scraping job (replace <snapshot-id> with your actual snapshot ID):
bash -c 'curl -s "https://api.brightdata.com/datasets/v3/progress/<snapshot-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
Response:
{
"snapshot_id": "s_m4x7enmven8djfqak",
"dataset_id": "gd_xxxxx",
"status": "running"
}
Status values: running, ready, failed
4. Download Results
Once status is ready, download the collected data (replace <snapshot-id> with your actual snapshot ID):
bash -c 'curl -s "https://api.brightdata.com/datasets/v3/snapshot/<snapshot-id>?format=json" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
5. List Snapshots
Get all your snapshots:
bash -c 'curl -s "https://api.brightdata.com/datasets/v3/snapshots" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"' | jq '.[] | {snapshot_id, dataset_id, status}'
6. Cancel Snapshot
Cancel a running job (replace <snapshot-id> with your actual snapshot ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/cancel?snapshot_id=<snapshot-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
Platform-Specific Examples
Twitter/X – Scrape Profile
Write to /tmp/brightdata_request.json:
[
{"url": "https://twitter.com/elonmusk"}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Returns: x_id, profile_name, biography, is_verified, followers, following, profile_image_link
Twitter/X – Scrape Posts
Write to /tmp/brightdata_request.json:
[
{"url": "https://twitter.com/username/status/123456789"}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Returns: post_id, text, replies, likes, retweets, views, hashtags, media
Reddit – Scrape Subreddit Posts
Write to /tmp/brightdata_request.json:
[
{"url": "https://www.reddit.com/r/technology", "sort_by": "hot"}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Parameters: url, sort_by (new/top/hot)
Returns: post_id, title, description, num_comments, upvotes, date_posted, community
Reddit – Scrape Comments
Write to /tmp/brightdata_request.json:
[
{"url": "https://www.reddit.com/r/technology/comments/xxxxx/post_title"}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Returns: comment_id, user_posted, comment_text, upvotes, replies
YouTube – Scrape Video Info
Write to /tmp/brightdata_request.json:
[
{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Returns: title, views, likes, num_comments, video_length, transcript, channel_name
YouTube – Search by Keyword
Write to /tmp/brightdata_request.json:
[
{"keyword": "artificial intelligence", "num_of_posts": 50}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
YouTube – Scrape Comments
Write to /tmp/brightdata_request.json:
[
{"url": "https://www.youtube.com/watch?v=xxxxx", "load_replies": 3}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Returns: comment_text, likes, replies, username, date
Instagram – Scrape Profile
Write to /tmp/brightdata_request.json:
[
{"url": "https://www.instagram.com/username"}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Returns: followers, post_count, profile_name, is_verified, biography
Instagram – Scrape Posts
Write to /tmp/brightdata_request.json:
[
{
"url": "https://www.instagram.com/username",
"num_of_posts": 20,
"start_date": "01-01-2024",
"end_date": "12-31-2024"
}
]
Then run (replace <dataset-id> with your actual dataset ID):
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json'
Account Management
Check Account Status
bash -c 'curl -s "https://api.brightdata.com/status" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
Response:
{
"status": "active",
"customer": "hl_xxxxxxxx",
"can_make_requests": true,
"ip": "x.x.x.x"
}
Get Active Zones
bash -c 'curl -s "https://api.brightdata.com/zone/get_active_zones" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"' | jq '.[] | {name, type}'
Get Bandwidth Usage
bash -c 'curl -s "https://api.brightdata.com/customer/bw" \
-H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
Getting Dataset IDs
To use the scraping features, you need a dataset_id:
- Go to Bright Data Control Panel
- Create a new Web Scraper dataset or select an existing one
- Choose the platform (Twitter, Reddit, YouTube, etc.)
- Copy the
dataset_idfrom the dataset settings
Dataset IDs can also be found in the bandwidth usage API response under the data field keys (e.g., v__ds_api_gd_xxxxx where gd_xxxxx is your dataset ID).
Common Parameters
| Parameter | Description | Example |
|---|---|---|
url |
Target URL to scrape | https://twitter.com/user |
keyword |
Search keyword | "artificial intelligence" |
num_of_posts |
Limit number of results | 50 |
start_date |
Filter by date (MM-DD-YYYY) | "01-01-2024" |
end_date |
Filter by date (MM-DD-YYYY) | "12-31-2024" |
sort_by |
Sort order (Reddit) | new, top, hot |
format |
Response format | json, csv |
Rate Limits
- Batch mode: up to 100 concurrent requests
- Maximum input size: 1GB per batch
- Exceeding limits returns
429error
Guidelines
- Create datasets first: Use the Control Panel to create scraper datasets
- Use async for large jobs: Use
/triggerfor discovery and batch operations - Use sync for small jobs: Use
/scrapefor single URL quick lookups - Check status before download: Poll
/progressuntil status isready - Respect rate limits: Don’t exceed 100 concurrent requests
- Date format: Use MM-DD-YYYY for date parameters