web-scraper

📁 ivanvza/dspy-skills 📅 10 days ago
1
总安装量
1
周安装量
#53435
全站排名
安装命令
npx skills add https://github.com/ivanvza/dspy-skills --skill web-scraper

Agent 安装分布

replit 1
windsurf 1
opencode 1
cursor 1
claude-code 1

Skill 文档

Web Scraper

A toolkit for extracting content from web pages using Python.

When to Use This Skill

Activate this skill when the user needs to:

  • Fetch the HTML content of a web page
  • Extract all links from a page
  • Get readable text content from HTML
  • Scrape data from websites
  • Download and analyze web content

Requirements

This skill requires external packages:

pip install requests beautifulsoup4

Available Scripts

Always run scripts with --help first to see all available options.

Script Purpose
fetch_page.py Download HTML content from a URL
extract_links.py Extract all links from a page
extract_text.py Extract readable text from HTML

Decision Tree

Task → What do you need?
    │
    ├─ Raw HTML content?
    │   └─ Use: fetch_page.py <url>
    │
    ├─ List of links on a page?
    │   └─ Use: extract_links.py <url>
    │
    └─ Text content (no HTML tags)?
        └─ Use: extract_text.py <url>

Quick Examples

Fetch page HTML:

python scripts/fetch_page.py https://example.com
python scripts/fetch_page.py https://example.com --output page.html

Extract all links:

python scripts/extract_links.py https://example.com
python scripts/extract_links.py https://example.com --absolute --filter "\.pdf$"

Extract text content:

python scripts/extract_text.py https://example.com
python scripts/extract_text.py https://example.com --paragraphs

Best Practices

  1. Respect robots.txt – Check if scraping is allowed
  2. Add delays – Don’t overwhelm servers with rapid requests
  3. Use appropriate User-Agent – Identify your scraper properly
  4. Handle errors gracefully – Websites may block or timeout
  5. Cache responses – Don’t re-fetch unchanged pages

Common Issues

  • 403 Forbidden: Site may be blocking scrapers. Try with --user-agent flag.
  • Timeout: Site may be slow. Increase --timeout value.
  • Empty content: Page may require JavaScript. These scripts handle static HTML only.
  • Encoding issues: Use --encoding flag if text appears garbled.

Reference Files

See references/selectors.md for CSS selector syntax reference.

Ethical Considerations

  • Only scrape public data
  • Respect rate limits and robots.txt
  • Don’t scrape personal/private information
  • Check website terms of service
  • Consider using official APIs when available