web-scraper
1
总安装量
1
周安装量
#53435
全站排名
安装命令
npx skills add https://github.com/ivanvza/dspy-skills --skill web-scraper
Agent 安装分布
replit
1
windsurf
1
opencode
1
cursor
1
claude-code
1
Skill 文档
Web Scraper
A toolkit for extracting content from web pages using Python.
When to Use This Skill
Activate this skill when the user needs to:
- Fetch the HTML content of a web page
- Extract all links from a page
- Get readable text content from HTML
- Scrape data from websites
- Download and analyze web content
Requirements
This skill requires external packages:
pip install requests beautifulsoup4
Available Scripts
Always run scripts with --help first to see all available options.
| Script | Purpose |
|---|---|
fetch_page.py |
Download HTML content from a URL |
extract_links.py |
Extract all links from a page |
extract_text.py |
Extract readable text from HTML |
Decision Tree
Task â What do you need?
â
ââ Raw HTML content?
â ââ Use: fetch_page.py <url>
â
ââ List of links on a page?
â ââ Use: extract_links.py <url>
â
ââ Text content (no HTML tags)?
ââ Use: extract_text.py <url>
Quick Examples
Fetch page HTML:
python scripts/fetch_page.py https://example.com
python scripts/fetch_page.py https://example.com --output page.html
Extract all links:
python scripts/extract_links.py https://example.com
python scripts/extract_links.py https://example.com --absolute --filter "\.pdf$"
Extract text content:
python scripts/extract_text.py https://example.com
python scripts/extract_text.py https://example.com --paragraphs
Best Practices
- Respect robots.txt – Check if scraping is allowed
- Add delays – Don’t overwhelm servers with rapid requests
- Use appropriate User-Agent – Identify your scraper properly
- Handle errors gracefully – Websites may block or timeout
- Cache responses – Don’t re-fetch unchanged pages
Common Issues
- 403 Forbidden: Site may be blocking scrapers. Try with
--user-agentflag. - Timeout: Site may be slow. Increase
--timeoutvalue. - Empty content: Page may require JavaScript. These scripts handle static HTML only.
- Encoding issues: Use
--encodingflag if text appears garbled.
Reference Files
See references/selectors.md for CSS selector syntax reference.
Ethical Considerations
- Only scrape public data
- Respect rate limits and robots.txt
- Don’t scrape personal/private information
- Check website terms of service
- Consider using official APIs when available