document-parsers
npx skills add https://github.com/vanman2024/ai-dev-marketplace --skill document-parsers
Agent 安装分布
Skill 文档
Document Parsers
Purpose: Autonomously parse and extract content from multiple document formats (PDF, DOCX, HTML, Markdown) using industry-standard libraries and AI-powered parsing tools.
Activation Triggers:
- Building RAG (Retrieval-Augmented Generation) pipelines
- Extracting text, tables, or metadata from documents
- Processing large document collections
- Converting documents to structured formats
- Handling complex PDFs with tables and layouts
- OCR for scanned documents
- Chunking documents for vector embeddings
- Building document search systems
Key Resources:
scripts/setup-llamaparse.sh– Install and configure LlamaParse (AI-powered parsing)scripts/setup-unstructured.sh– Install Unstructured.io libraryscripts/parse-pdf.py– Functional PDF parser with multiple backend optionsscripts/parse-docx.py– DOCX document parserscripts/parse-html.py– HTML to structured text parsertemplates/multi-format-parser.py– Universal document parser templatetemplates/table-extraction.py– Specialized table extraction templateexamples/parse-research-paper.py– Research paper parsing with citationsexamples/parse-legal-document.py– Legal document parsing with sections
Parser Comparison & Selection Guide
1. LlamaParse (AI-Powered Premium)
Best For:
- Complex PDFs with tables, charts, and mixed layouts
- Scanned documents requiring OCR
- Documents where accuracy is critical
- Multi-column layouts and scientific papers
- Financial reports and invoices
Pros:
- AI-powered layout understanding
- Excellent table extraction accuracy
- Built-in OCR support
- Handles complex formatting
- Structured output (Markdown/JSON)
Cons:
- Requires API key (paid service)
- API rate limits
- Network dependency
- Slower than local parsers
Documentation: https://docs.cloud.llamaindex.ai/llamaparse
Setup:
./scripts/setup-llamaparse.sh
Usage Pattern:
from llama_parse import LlamaParse
parser = LlamaParse(
api_key="llx-...",
result_type="markdown", # or "text"
language="en",
verbose=True
)
documents = parser.load_data("document.pdf")
for doc in documents:
print(doc.text)
2. Unstructured.io (Local Processing)
Best For:
- Batch processing many documents
- Multiple format support (PDF, DOCX, HTML, PPTX, Images)
- Local processing without API dependencies
- Structured element extraction
- Production RAG pipelines
Pros:
- Open-source and free
- Multi-format support
- Runs locally (no API keys)
- Good table detection
- Element-based chunking
Cons:
- Requires system dependencies (poppler, tesseract)
- Complex installation
- Less accurate than LlamaParse for complex layouts
Documentation: https://unstructured-io.github.io/unstructured/
Setup:
./scripts/setup-unstructured.sh
Usage Pattern:
from unstructured.partition.auto import partition
elements = partition("document.pdf")
for element in elements:
print(f"{element.category}: {element.text}")
3. PyPDF2 (Simple PDF Text Extraction)
Best For:
- Simple text-based PDFs
- Quick prototyping
- Metadata extraction
- PDF manipulation (merge, split)
Pros:
- Pure Python (no dependencies)
- Fast and lightweight
- Good for simple PDFs
- Active maintenance
Cons:
- Poor table extraction
- Struggles with complex layouts
- No OCR support
- Limited formatting preservation
Documentation: https://github.com/py-pdf/pypdf2
Setup:
pip install pypdf2
Usage Pattern:
from PyPDF2 import PdfReader
reader = PdfReader("document.pdf")
for page in reader.pages:
print(page.extract_text())
4. PDFPlumber (Advanced PDF Analysis)
Best For:
- Table extraction from PDFs
- PDF with tabular data
- Financial statements and reports
- Coordinate-based extraction
Pros:
- Excellent table extraction
- Visual debugging tools
- Coordinate-level control
- Metadata and layout info
Cons:
- Slower than PyPDF2
- Requires pdfminer.six dependency
- No OCR support
Documentation: https://github.com/jsvine/pdfplumber
Setup:
pip install pdfplumber
Usage Pattern:
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
text = page.extract_text()
5. python-docx (Word Documents)
Best For:
- Microsoft Word (.docx) documents
- Extracting paragraphs, tables, headers
- Document metadata
- Template-based document generation
Pros:
- Native DOCX support
- Preserves structure (paragraphs, tables, sections)
- Access to styles and formatting
- Can also write/modify DOCX
Cons:
- Only works with .docx (not .doc)
- Limited image extraction
Documentation: https://github.com/python-openxml/python-docx
Setup:
pip install python-docx
Usage Pattern:
from docx import Document
doc = Document("document.docx")
for para in doc.paragraphs:
print(para.text)
for table in doc.tables:
for row in table.rows:
print([cell.text for cell in row.cells])
Decision Matrix
| Use Case | Recommended Parser | Alternative |
|---|---|---|
| Simple PDF text extraction | PyPDF2 | Unstructured |
| Complex PDFs with tables | LlamaParse | PDFPlumber |
| Scanned documents (OCR) | LlamaParse | Unstructured + Tesseract |
| Word documents (.docx) | python-docx | Unstructured |
| HTML to text | parse-html.py | Unstructured |
| Multi-format batch processing | Unstructured | Multi-format-parser |
| Table extraction | PDFPlumber | LlamaParse |
| Research papers | LlamaParse | Unstructured |
| Legal documents | LlamaParse | PDFPlumber |
| Production RAG pipeline | Unstructured | LlamaParse |
Functional Scripts
1. Parse PDF (scripts/parse-pdf.py)
Command-line PDF parser supporting multiple backends:
# Using PyPDF2 (default)
python scripts/parse-pdf.py document.pdf
# Using PDFPlumber (better for tables)
python scripts/parse-pdf.py document.pdf --backend pdfplumber
# Using LlamaParse (AI-powered)
python scripts/parse-pdf.py document.pdf --backend llamaparse --api-key llx-...
# Output to file
python scripts/parse-pdf.py document.pdf --output output.txt
# Extract tables as JSON
python scripts/parse-pdf.py document.pdf --backend pdfplumber --tables-only --output tables.json
Features:
- Multiple backend support (PyPDF2, PDFPlumber, LlamaParse)
- Table extraction
- Metadata extraction
- Page range selection
- JSON/Text output formats
2. Parse DOCX (scripts/parse-docx.py)
Word document parser with structure preservation:
# Basic extraction
python scripts/parse-docx.py document.docx
# Extract with structure
python scripts/parse-docx.py document.docx --preserve-structure
# Extract tables only
python scripts/parse-docx.py document.docx --tables-only
# Output as JSON
python scripts/parse-docx.py document.docx --output output.json --format json
Features:
- Paragraph extraction with styles
- Table extraction
- Header/footer extraction
- Metadata (author, created date, etc.)
- Structured JSON output
3. Parse HTML (scripts/parse-html.py)
HTML to clean text converter:
# Basic HTML parsing
python scripts/parse-html.py document.html
# From URL
python scripts/parse-html.py https://example.com/article
# Preserve links
python scripts/parse-html.py document.html --preserve-links
# Extract specific selector
python scripts/parse-html.py document.html --selector "article.content"
Features:
- Clean text extraction (removes scripts, styles)
- Link preservation
- CSS selector support
- URL fetching
- Markdown output option
Templates
Multi-Format Parser (templates/multi-format-parser.py)
Universal parser handling multiple formats with automatic format detection:
from multi_format_parser import MultiFormatParser
parser = MultiFormatParser(
llamaparse_api_key="llx-...", # Optional
use_ocr=True,
chunk_size=1000
)
# Automatic format detection
result = parser.parse_file("document.pdf")
print(result.text)
print(result.metadata)
print(result.tables)
# Batch processing
results = parser.parse_directory("./documents/")
for filename, result in results.items():
print(f"{filename}: {len(result.text)} characters")
Supports:
- PDF, DOCX, HTML, Markdown, TXT
- Automatic chunking for RAG
- Metadata extraction
- Table extraction across all formats
- Error handling and fallbacks
Table Extraction (templates/table-extraction.py)
Specialized table extraction with multiple strategies:
from table_extraction import TableExtractor
extractor = TableExtractor(
prefer_llamaparse=True,
fallback_to_pdfplumber=True
)
# Extract all tables from document
tables = extractor.extract_tables("financial_report.pdf")
for i, table in enumerate(tables):
print(f"Table {i + 1}:")
print(table.to_markdown()) # or .to_csv(), .to_json()
print(f"Confidence: {table.confidence}")
Features:
- Multiple extraction strategies
- Automatic fallback
- Table validation
- Format conversion (CSV, JSON, Markdown, DataFrame)
- Confidence scoring
Examples
Research Paper Parsing (examples/parse-research-paper.py)
Complete example for parsing academic papers:
# Extracts title, abstract, sections, citations, tables, figures
python examples/parse-research-paper.py paper.pdf --output paper.json
Extracts:
- Title and authors
- Abstract
- Section structure (Introduction, Methods, Results, etc.)
- Citations and references
- Tables and figures with captions
- Metadata (DOI, publication date, journal)
Legal Document Parsing (examples/parse-legal-document.py)
Specialized parser for legal documents:
# Extracts clauses, sections, definitions, parties
python examples/parse-legal-document.py contract.pdf --output contract.json
Extracts:
- Document type (contract, agreement, etc.)
- Parties involved
- Definitions section
- Numbered clauses and sections
- Signature blocks
- Dates and deadlines
RAG Pipeline Integration
Document Chunking for Embeddings
from multi_format_parser import MultiFormatParser
parser = MultiFormatParser(chunk_size=512, chunk_overlap=50)
result = parser.parse_file("document.pdf")
# Chunks ready for embedding
for chunk in result.chunks:
print(f"Chunk {chunk.id}: {chunk.text[:100]}...")
print(f"Metadata: {chunk.metadata}")
# Send to embedding model
Batch Processing Pipeline
import glob
from multi_format_parser import MultiFormatParser
parser = MultiFormatParser()
# Process all documents in directory
for filepath in glob.glob("./documents/**/*", recursive=True):
try:
result = parser.parse_file(filepath)
# Store in vector database
store_embeddings(result.chunks)
print(f"â Processed {filepath}")
except Exception as e:
print(f"â Failed {filepath}: {e}")
Best Practices
Parser Selection:
- Start with PyPDF2 for simple PDFs, upgrade if needed
- Use LlamaParse for complex layouts (budget permitting)
- Use Unstructured for multi-format production systems
- Use PDFPlumber specifically for table extraction
Performance:
- Cache parsed results to avoid re-processing
- Use batch processing for multiple documents
- Consider async processing for large collections
- Monitor API rate limits for LlamaParse
Accuracy:
- Validate table extraction results
- Implement fallback strategies
- Log parsing errors for debugging
- Use confidence scores when available
RAG Optimization:
- Chunk size: 512-1024 tokens for embeddings
- Overlap: 10-20% for context preservation
- Preserve metadata (page numbers, sections) for retrieval
- Clean extracted text (remove headers/footers)
Troubleshooting
PyPDF2 returns garbled text:
- Try PDFPlumber or LlamaParse
- PDF may have non-standard encoding
- Check if PDF is scanned (needs OCR)
Unstructured installation fails:
- Install system dependencies:
sudo apt-get install poppler-utils tesseract-ocr - On macOS:
brew install poppler tesseract
LlamaParse API errors:
- Verify API key is correct
- Check rate limits in dashboard
- Ensure document size is within limits
Table extraction misses columns:
- Try different parser (PDFPlumber vs LlamaParse)
- Adjust table detection settings
- Validate table structure manually
DOCX parsing fails:
- Ensure file is .docx not .doc
- Check file is not corrupted
- Try converting to .docx with LibreOffice
Dependencies
Core:
pip install pypdf2 pdfplumber python-docx beautifulsoup4 lxml markdown
Optional (Unstructured):
pip install unstructured[local-inference]
sudo apt-get install poppler-utils tesseract-ocr # Linux
brew install poppler tesseract # macOS
Optional (LlamaParse):
pip install llama-parse
# Requires API key from https://cloud.llamaindex.ai
Supported Formats: PDF, DOCX, HTML, Markdown, TXT Parsers: LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, python-docx Version: 1.0.0