html-parser-rule

📁 daqi/daqi-skills 📅 Jan 28, 2026

总安装量

周安装量

#48815

全站排名

安装命令

npx skills add https://github.com/daqi/daqi-skills --skill html-parser-rule

Agent 安装分布

opencode 2

claude-code 2

github-copilot 2

codex 2

gemini-cli 2

kimi-cli 1

Skill 文档

HTML Parser Rule Writer Skill

Trigger Keywords

“write html parser rule”
“create html parser”
“parse html source”
“add html parser”
“ç¼åhtmlè§£æè§å”
“åå»ºhtmlè§£æå¨”
“æ·»å htmlè§£æè§å”

Skill Description

This skill helps you write and test HTML parsing rules for article-flow project. It follows a systematic approach:

Fetch the HTML content and save locally
Write regex patterns step by step
Test each regex individually
Complete the full parsing rule
Run final integration test

Instructions

Step 1: Fetch HTML Content

When user provides a URL to parse, first fetch and save the HTML:

# Fetch HTML and save to temporary file
curl -L -H "User-Agent: Mozilla/5.0" "{URL}" > /tmp/source.html

# Show file size and first 100 lines to understand structure
wc -l /tmp/source.html
head -100 /tmp/source.html

Ask user to confirm the HTML looks correct and identify the article listing structure.

Step 2: Identify Article Pattern

Examine the HTML structure and identify:

Article container element (div, article, li, etc.)
Title element and its pattern
Link element and its pattern
Date element and its pattern (if available)
Description element and its pattern (if available)

Show the user example HTML snippets and ask for confirmation.

Step 3: Write Title Regex

Create and test the title extraction regex:

// Example title regex
const titleRegex = /<h2[^>]*><a[^>]*>([^<]+)<\/a><\/h2>/g;

Test it immediately:

# Test title regex in Node.js
node -e "
const fs = require('fs');
const html = fs.readFileSync('/tmp/source.html', 'utf-8');
const titleRegex = /<h2[^>]*><a[^>]*>([^<]+)<\/a><\/h2>/g;
const matches = [...html.matchAll(titleRegex)];
console.log('Found', matches.length, 'titles:');
matches.slice(0, 5).forEach((m, i) => console.log(i+1, m[1]));
"

Wait for user confirmation before proceeding.

Step 4: Write Link Regex

Create and test the link extraction regex:

// Example link regex
const linkRegex = /<a[^>]*href="([^"]+)"[^>]*>.*?<\/a>/g;

Test it:

node -e "
const fs = require('fs');
const html = fs.readFileSync('/tmp/source.html', 'utf-8');
const linkRegex = /<a[^>]*href=\"([^\"]+)\"[^>]*>/g;
const matches = [...html.matchAll(linkRegex)];
console.log('Found', matches.length, 'links:');
matches.slice(0, 5).forEach((m, i) => console.log(i+1, m[1]));
"

Wait for user confirmation before proceeding.

Step 5: Write Date Regex (if applicable)

If the source has date information, create and test:

// Example date regex
const dateRegex = /<time[^>]*datetime="([^"]+)"[^>]*>/g;

Test it:

node -e "
const fs = require('fs');
const html = fs.readFileSync('/tmp/source.html', 'utf-8');
const dateRegex = /<time[^>]*datetime=\"([^\"]+)\"[^>]*>/g;
const matches = [...html.matchAll(dateRegex)];
console.log('Found', matches.length, 'dates:');
matches.slice(0, 5).forEach((m, i) => console.log(i+1, m[1]));
"

Step 6: Write Description Regex (if applicable)

If the source has description/summary, create and test:

// Example description regex
const descRegex = /<p class="excerpt">([^<]+)<\/p>/g;

Test it similarly.

Step 7: Create Complete Parser

Now create the complete parser function in the html-parser.ts file:

/**
 * Parse {SOURCE_NAME}
 */
private parseSourceName(html: string, source: DataSource): ContentItem[] {
  const items: ContentItem[] = [];
  
  // Your regex patterns
  const titleRegex = /pattern/g;
  const linkRegex = /pattern/g;
  const dateRegex = /pattern/g;
  const descRegex = /pattern/g;
  
  // Extract all matches
  const titles = [...html.matchAll(titleRegex)];
  const links = [...html.matchAll(linkRegex)];
  const dates = [...html.matchAll(dateRegex)];
  const descs = [...html.matchAll(descRegex)];
  
  // Combine into items
  const maxLength = Math.max(titles.length, links.length);
  for (let i = 0; i < maxLength; i++) {
    const title = titles[i]?.[1];
    const link = links[i]?.[1];
    const date = dates[i]?.[1];
    const desc = descs[i]?.[1];
    
    if (title && link) {
      items.push({
        title: this.decodeHtml(title.trim()),
        link: this.resolveUrl(link, source.url),
        source: source.name,
        sourceUrl: source.url,
        isoDate: date ? new Date(date).toISOString() : undefined,
        contentSnippet: desc ? this.decodeHtml(desc.trim()) : undefined
      });
    }
  }
  
  return items;
}

Step 8: Register Parser

Add the parser to the parse() method’s switch statement:

case "source-name":
  return this.parseSourceName(html, source);

Step 9: Update Domain Config

Add the source to the domain configuration:

{
  name: "Source Name",
  url: "https://...",
  type: "html",
  parser: "source-name",
  tags: ["tag1", "tag2"]
}

Step 10: Integration Test

Run the full collection to test:

cd packages/ai-digest
pnpm run collect

Check the report to verify:

Source status is “success”
Items found count is reasonable (> 0)
No errors in the log
Generated articles contain content from this source

Step 11: Final Verification

Examine the output files:

# Check if articles from this source appear in output
grep "Source Name" outputs/$(date +%Y-%m-%d).en.md

If all tests pass, the parser rule is complete!

Important Notes

Always test each regex individually before combining
Use decodeHtml() for title and description to handle HTML entities
Use resolveUrl() for links to handle relative URLs
Handle missing fields gracefully with optional chaining (?.)
Verify the parser name matches in both html-parser.ts and domain config
Check the report after running to see actual results

Common Patterns

Blog Post Listing

// Often structured as:
// <article>
//   <h2><a href="...">Title</a></h2>
//   <time datetime="...">Date</time>
//   <p>Description</p>
// </article>

News Site

// Often structured as:
// <div class="article">
//   <a href="...">
//     <h3>Title</h3>
//   </a>
//   <span class="date">Date</span>
//   <div class="summary">Description</div>
// </div>

Tech Site (like VentureBeat)

// Often uses JSON-LD structured data:
const scriptRegex = /<script type="application\/ld\+json">(.*?)<\/script>/gs;
// Then parse the JSON

Troubleshooting

No matches found

Check if HTML structure has changed
Verify regex escaping
Try broader patterns first, then narrow down

Wrong content extracted

Check capture groups in regex
Verify you’re using the right index (m[1], not m[0])
Test with html.match() to see full matches

Relative URLs not resolved

Ensure you’re using this.resolveUrl()
Pass the source.url as base URL

HTML entities not decoded

Ensure you’re using this.decodeHtml()
Check for &, “, <, >, etc.

Example Session

User: “ç¼åhtmlè§£æè§å https://venturebeat.com/“

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台