html-parser-rule
npx skills add https://github.com/daqi/daqi-skills --skill html-parser-rule
Agent 安装分布
Skill 文档
HTML Parser Rule Writer Skill
Trigger Keywords
- “write html parser rule”
- “create html parser”
- “parse html source”
- “add html parser”
- “ç¼åhtmlè§£æè§å”
- “å建htmlè§£æå¨”
- “æ·»å htmlè§£æè§å”
Skill Description
This skill helps you write and test HTML parsing rules for article-flow project. It follows a systematic approach:
- Fetch the HTML content and save locally
- Write regex patterns step by step
- Test each regex individually
- Complete the full parsing rule
- Run final integration test
Instructions
Step 1: Fetch HTML Content
When user provides a URL to parse, first fetch and save the HTML:
# Fetch HTML and save to temporary file
curl -L -H "User-Agent: Mozilla/5.0" "{URL}" > /tmp/source.html
# Show file size and first 100 lines to understand structure
wc -l /tmp/source.html
head -100 /tmp/source.html
Ask user to confirm the HTML looks correct and identify the article listing structure.
Step 2: Identify Article Pattern
Examine the HTML structure and identify:
- Article container element (div, article, li, etc.)
- Title element and its pattern
- Link element and its pattern
- Date element and its pattern (if available)
- Description element and its pattern (if available)
Show the user example HTML snippets and ask for confirmation.
Step 3: Write Title Regex
Create and test the title extraction regex:
// Example title regex
const titleRegex = /<h2[^>]*><a[^>]*>([^<]+)<\/a><\/h2>/g;
Test it immediately:
# Test title regex in Node.js
node -e "
const fs = require('fs');
const html = fs.readFileSync('/tmp/source.html', 'utf-8');
const titleRegex = /<h2[^>]*><a[^>]*>([^<]+)<\/a><\/h2>/g;
const matches = [...html.matchAll(titleRegex)];
console.log('Found', matches.length, 'titles:');
matches.slice(0, 5).forEach((m, i) => console.log(i+1, m[1]));
"
Wait for user confirmation before proceeding.
Step 4: Write Link Regex
Create and test the link extraction regex:
// Example link regex
const linkRegex = /<a[^>]*href="([^"]+)"[^>]*>.*?<\/a>/g;
Test it:
node -e "
const fs = require('fs');
const html = fs.readFileSync('/tmp/source.html', 'utf-8');
const linkRegex = /<a[^>]*href=\"([^\"]+)\"[^>]*>/g;
const matches = [...html.matchAll(linkRegex)];
console.log('Found', matches.length, 'links:');
matches.slice(0, 5).forEach((m, i) => console.log(i+1, m[1]));
"
Wait for user confirmation before proceeding.
Step 5: Write Date Regex (if applicable)
If the source has date information, create and test:
// Example date regex
const dateRegex = /<time[^>]*datetime="([^"]+)"[^>]*>/g;
Test it:
node -e "
const fs = require('fs');
const html = fs.readFileSync('/tmp/source.html', 'utf-8');
const dateRegex = /<time[^>]*datetime=\"([^\"]+)\"[^>]*>/g;
const matches = [...html.matchAll(dateRegex)];
console.log('Found', matches.length, 'dates:');
matches.slice(0, 5).forEach((m, i) => console.log(i+1, m[1]));
"
Step 6: Write Description Regex (if applicable)
If the source has description/summary, create and test:
// Example description regex
const descRegex = /<p class="excerpt">([^<]+)<\/p>/g;
Test it similarly.
Step 7: Create Complete Parser
Now create the complete parser function in the html-parser.ts file:
/**
* Parse {SOURCE_NAME}
*/
private parseSourceName(html: string, source: DataSource): ContentItem[] {
const items: ContentItem[] = [];
// Your regex patterns
const titleRegex = /pattern/g;
const linkRegex = /pattern/g;
const dateRegex = /pattern/g;
const descRegex = /pattern/g;
// Extract all matches
const titles = [...html.matchAll(titleRegex)];
const links = [...html.matchAll(linkRegex)];
const dates = [...html.matchAll(dateRegex)];
const descs = [...html.matchAll(descRegex)];
// Combine into items
const maxLength = Math.max(titles.length, links.length);
for (let i = 0; i < maxLength; i++) {
const title = titles[i]?.[1];
const link = links[i]?.[1];
const date = dates[i]?.[1];
const desc = descs[i]?.[1];
if (title && link) {
items.push({
title: this.decodeHtml(title.trim()),
link: this.resolveUrl(link, source.url),
source: source.name,
sourceUrl: source.url,
isoDate: date ? new Date(date).toISOString() : undefined,
contentSnippet: desc ? this.decodeHtml(desc.trim()) : undefined
});
}
}
return items;
}
Step 8: Register Parser
Add the parser to the parse() method’s switch statement:
case "source-name":
return this.parseSourceName(html, source);
Step 9: Update Domain Config
Add the source to the domain configuration:
{
name: "Source Name",
url: "https://...",
type: "html",
parser: "source-name",
tags: ["tag1", "tag2"]
}
Step 10: Integration Test
Run the full collection to test:
cd packages/ai-digest
pnpm run collect
Check the report to verify:
- Source status is “success”
- Items found count is reasonable (> 0)
- No errors in the log
- Generated articles contain content from this source
Step 11: Final Verification
Examine the output files:
# Check if articles from this source appear in output
grep "Source Name" outputs/$(date +%Y-%m-%d).en.md
If all tests pass, the parser rule is complete!
Important Notes
- Always test each regex individually before combining
- Use decodeHtml() for title and description to handle HTML entities
- Use resolveUrl() for links to handle relative URLs
- Handle missing fields gracefully with optional chaining (?.)
- Verify the parser name matches in both html-parser.ts and domain config
- Check the report after running to see actual results
Common Patterns
Blog Post Listing
// Often structured as:
// <article>
// <h2><a href="...">Title</a></h2>
// <time datetime="...">Date</time>
// <p>Description</p>
// </article>
News Site
// Often structured as:
// <div class="article">
// <a href="...">
// <h3>Title</h3>
// </a>
// <span class="date">Date</span>
// <div class="summary">Description</div>
// </div>
Tech Site (like VentureBeat)
// Often uses JSON-LD structured data:
const scriptRegex = /<script type="application\/ld\+json">(.*?)<\/script>/gs;
// Then parse the JSON
Troubleshooting
No matches found
- Check if HTML structure has changed
- Verify regex escaping
- Try broader patterns first, then narrow down
Wrong content extracted
- Check capture groups in regex
- Verify you’re using the right index (m[1], not m[0])
- Test with html.match() to see full matches
Relative URLs not resolved
- Ensure you’re using this.resolveUrl()
- Pass the source.url as base URL
HTML entities not decoded
- Ensure you’re using this.decodeHtml()
- Check for &, “, <, >, etc.
Example Session
User: “ç¼åhtmlè§£æè§å https://venturebeat.com/“