hf-papers-reporter

📁 xdrshjr/jr-openclaw-skills 📅 7 days ago

总安装量

周安装量

#75415

全站排名

安装命令

npx skills add https://github.com/xdrshjr/jr-openclaw-skills --skill hf-papers-reporter

Agent 安装分布

trae 2

gemini-cli 2

replit 2

antigravity 2

claude-code 2

codex 2

Skill 文档

Hugging Face Daily Papers Reporter

Generate professional Word reports from Hugging Face Daily Papers with full text extraction and image capture.

What This Skill Does

Scrapes huggingface.co/papers for the top papers
Downloads PDFs from arXiv
Extracts Abstract and Introduction sections
Extracts figures/images from PDFs
Generates a formatted Word document (.docx) with:
- Paper titles and arXiv links
- Cover images from HF
- Full abstracts
- Introduction sections
- Extracted figures from papers

Quick Start

Run the main script to generate today’s report:

cd /path/to/hf-papers-reporter
python3 scripts/process_papers.py

Output will be saved to output/HF_Daily_Papers_Report.docx

Dependencies

Install required packages:

pip3 install PyMuPDF python-docx Pillow beautifulsoup4 requests

How It Works

Step 1: Fetch Paper List

Scrapes huggingface.co/papers
Extracts arXiv IDs, titles, and cover image URLs

Step 2: Download & Process (per paper)

Download PDF from arxiv.org/pdf/{id}.pdf
    â
Extract text (first 5 pages)
    - Abstract (regex match)
    - Introduction (regex match)
    â
Extract images (first 5 pages, max 3 per page)
    - Compress to 600x400
    â
Download cover image from HF CDN
    - Compress to 800x600

Step 3: Generate Word Document

Title page with report name and date
Each paper as a section with:
- Cover image (centered)
- Abstract section
- Introduction section
- Extracted figures (up to 4)

Output Structure

hf_papers/
âââ pdfs/           # Downloaded PDFs
âââ images/         # Cover images + extracted figures
âââ output/
    âââ HF_Daily_Papers_Report.docx
    âââ papers_data.json

Known Issues & Solutions

Issue	Cause	Fix
XML encoding error	PDF text contains control characters	Script auto-cleans 0x00-0x1F chars
No abstract found	PDF structure varies	Multiple regex patterns tried
Large PDFs	Some papers are 20MB+	Only first 5 pages processed

Customization

To modify the number of papers (default: 10), edit the PAPERS list in scripts/process_papers.py.

To change image sizes, modify the thumbnail() calls in the script.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台