llms-txt-crawler
8
总安装量
8
周安装量
#33882
全站排名
安装命令
npx skills add https://github.com/agykit/agykit --skill llms-txt-crawler
Agent 安装分布
opencode
6
claude-code
5
openclaw
5
trae
4
kiro-cli
4
cursor
4
Skill 文档
llms.txt Crawler Skill
This skill enables you to fetch llms.txt files from websites and crawl all pages listed within them. The llms.txt format is a standard way for websites to provide LLM-friendly content listings.
Overview
The llms.txt file typically follows this format:
# Site Name
## Section Name
- [Page Title](https://example.com/page.md): Description of the page
- [Another Page](https://example.com/another.md): Another description
This skill parses these files and downloads all linked content.
Usage
Basic Usage
Run the crawl script with a target URL:
cd /path/to/skills/llms-txt-crawler/scripts
npm install # First time only
node crawl.js --url https://example.com
Command Line Options
| Option | Short | Description | Default |
|---|---|---|---|
--url |
-u |
Base URL of the site with llms.txt | Required |
--output |
-o |
Output directory for crawled files | ./output |
--format |
-f |
Output format: md, json, or txt |
md |
--delay |
-d |
Delay between requests in milliseconds | 500 |
--concurrent |
-c |
Maximum concurrent requests | 3 |
Examples
Crawl agentskills.io documentation:
node crawl.js --url https://agentskills.io --output ./agentskills-docs
Crawl with custom rate limiting:
node crawl.js --url https://example.com --delay 1000 --concurrent 2
Output as JSON:
node crawl.js --url https://example.com --format json
Output Structure
The script creates the following output structure:
output/
âââ llms.txt # Original llms.txt file
âââ index.json # Metadata about all crawled pages
âââ pages/
âââ page-1.md
âââ page-2.md
âââ ...
Error Handling
- Network errors: Retries up to 3 times with exponential backoff
- Rate limiting: Respects delay settings between requests
- Missing pages: Logs warnings but continues crawling other pages
- Invalid URLs: Skips and logs invalid URLs
Integration Tips
When using this skill in an agent workflow:
- First run the crawler to download content
- The
index.jsonfile contains metadata about all pages - Use the downloaded markdown files for context or analysis
See Also
- llms.txt Specification
- scripts/crawl.js – The main crawler script