web-scraper

📁 ivanvza/dspy-skills 📅 10 days ago

总安装量

周安装量

#53435

全站排名

安装命令

npx skills add https://github.com/ivanvza/dspy-skills --skill web-scraper

Agent 安装分布

replit 1

windsurf 1

opencode 1

cursor 1

claude-code 1

Skill 文档

Web Scraper

A toolkit for extracting content from web pages using Python.

When to Use This Skill

Activate this skill when the user needs to:

Fetch the HTML content of a web page
Extract all links from a page
Get readable text content from HTML
Scrape data from websites
Download and analyze web content

Requirements

This skill requires external packages:

pip install requests beautifulsoup4

Available Scripts

Always run scripts with --help first to see all available options.

Script	Purpose
`fetch_page.py`	Download HTML content from a URL
`extract_links.py`	Extract all links from a page
`extract_text.py`	Extract readable text from HTML

Decision Tree

Task â What do you need?
    â
    ââ Raw HTML content?
    â   ââ Use: fetch_page.py <url>
    â
    ââ List of links on a page?
    â   ââ Use: extract_links.py <url>
    â
    ââ Text content (no HTML tags)?
        ââ Use: extract_text.py <url>

Quick Examples

Fetch page HTML:

python scripts/fetch_page.py https://example.com
python scripts/fetch_page.py https://example.com --output page.html

Extract all links:

python scripts/extract_links.py https://example.com
python scripts/extract_links.py https://example.com --absolute --filter "\.pdf$"

Extract text content:

python scripts/extract_text.py https://example.com
python scripts/extract_text.py https://example.com --paragraphs

Best Practices

Respect robots.txt – Check if scraping is allowed
Add delays – Don’t overwhelm servers with rapid requests
Use appropriate User-Agent – Identify your scraper properly
Handle errors gracefully – Websites may block or timeout
Cache responses – Don’t re-fetch unchanged pages

Common Issues

403 Forbidden: Site may be blocking scrapers. Try with --user-agent flag.
Timeout: Site may be slow. Increase --timeout value.
Empty content: Page may require JavaScript. These scripts handle static HTML only.
Encoding issues: Use --encoding flag if text appears garbled.

Reference Files

See references/selectors.md for CSS selector syntax reference.

Ethical Considerations

Only scrape public data
Respect rate limits and robots.txt
Don’t scrape personal/private information
Check website terms of service
Consider using official APIs when available

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台