mineru
4
总安装量
3
周安装量
#48391
全站排名
安装命令
npx skills add https://github.com/nebutra/mineru-skill --skill mineru
Agent 安装分布
qoder
3
antigravity
3
codebuddy
3
claude-code
3
windsurf
3
codex
3
Skill 文档
MinerU Document Parser
Convert PDF, Word, PPT, and images to clean Markdown using MinerU’s VLM engine â LaTeX formulas, tables, and images all preserved.
Setup
- Get free API token at https://mineru.net/user-center/api-token
export MINERU_TOKEN="your-token-here"
Limits: 2000 pages/day · 200 MB per file · 600 pages per file
Supported File Types
| Type | Formats |
|---|---|
| ð PDF | .pdf â papers, textbooks, scanned docs |
| ð Word | .docx â reports, manuscripts |
| ð PPT | .pptx â slides, presentations |
| ð¼ï¸ Image | .jpg, .jpeg, .png â OCR extraction |
Commands
Single File
python3 scripts/mineru_v2.py --file ./document.pdf --output ./output/
Batch Directory with Resume
python3 scripts/mineru_v2.py \
--dir ./docs/ \
--output ./output/ \
--workers 10 \
--resume
Direct to Obsidian
python3 scripts/mineru_v2.py \
--dir ./pdfs/ \
--output "~/Library/Mobile Documents/com~apple~CloudDocs/Obsidian/VaultName/" \
--resume
Chinese Documents
python3 scripts/mineru_v2.py --dir ./papers/ --output ./output/ --language ch
Complex Layouts (Slow but Most Accurate)
python3 scripts/mineru_v2.py --file ./paper.pdf --output ./output/ --model vlm
CLI Options
--dir PATH Input directory (PDF/Word/PPT/images)
--file PATH Single file
--output PATH Output directory (default: ./output/)
--workers N Concurrent workers (default: 5, max: 15)
--resume Skip already processed files
--model MODEL Model version: pipeline | vlm | MinerU-HTML (default: vlm)
--language LANG Document language: auto | en | ch (default: auto)
--no-formula Disable formula recognition
--no-table Disable table extraction
--token TOKEN API token (overrides MINERU_TOKEN env var)
Model Version Guide
| Model | Speed | Accuracy | Best For |
|---|---|---|---|
pipeline |
â¡ Fast | High | Standard docs, most use cases |
vlm |
ð¢ Slow | Highest | Complex layouts, multi-column, mixed text+figures |
MinerU-HTML |
â¡ Fast | High | Web-style output, HTML-ready content |
Script Selection
| Script | Use When |
|---|---|
mineru_v2.py |
Default â async parallel (up to 15 workers) |
mineru_async.py |
Fast network, need maximum throughput |
mineru_stable.py |
Unstable network â sequential, max retry |
Output Structure
output/
âââ document-name/
â âââ document-name.md # Main Markdown
â âââ images/ # Extracted images
â âââ content.json # Metadata
Performance
| Workers | Speed |
|---|---|
| 1 (sequential) | 1.2 files/min |
| 5 | 3.1 files/min |
| 15 | 5.6 files/min |
Error Handling
- 5x auto-retry with exponential backoff
- Use
--resumeto continue interrupted batches - Failed files listed at end of run
API Reference
For detailed API documentation, see references/api_reference.md.