document-parsers

📁 vanman2024/ai-dev-marketplace 📅 Jan 28, 2026

总安装量

周安装量

#34531

全站排名

安装命令

npx skills add https://github.com/vanman2024/ai-dev-marketplace --skill document-parsers

Agent 安装分布

codex 4

gemini-cli 3

amp 2

opencode 2

kimi-cli 2

github-copilot 2

Skill 文档

Document Parsers

Purpose: Autonomously parse and extract content from multiple document formats (PDF, DOCX, HTML, Markdown) using industry-standard libraries and AI-powered parsing tools.

Activation Triggers:

Building RAG (Retrieval-Augmented Generation) pipelines
Extracting text, tables, or metadata from documents
Processing large document collections
Converting documents to structured formats
Handling complex PDFs with tables and layouts
OCR for scanned documents
Chunking documents for vector embeddings
Building document search systems

Key Resources:

scripts/setup-llamaparse.sh – Install and configure LlamaParse (AI-powered parsing)
scripts/setup-unstructured.sh – Install Unstructured.io library
scripts/parse-pdf.py – Functional PDF parser with multiple backend options
scripts/parse-docx.py – DOCX document parser
scripts/parse-html.py – HTML to structured text parser
templates/multi-format-parser.py – Universal document parser template
templates/table-extraction.py – Specialized table extraction template
examples/parse-research-paper.py – Research paper parsing with citations
examples/parse-legal-document.py – Legal document parsing with sections

Parser Comparison & Selection Guide

1. LlamaParse (AI-Powered Premium)

Best For:

Complex PDFs with tables, charts, and mixed layouts
Scanned documents requiring OCR
Documents where accuracy is critical
Multi-column layouts and scientific papers
Financial reports and invoices

Pros:

AI-powered layout understanding
Excellent table extraction accuracy
Built-in OCR support
Handles complex formatting
Structured output (Markdown/JSON)

Cons:

Requires API key (paid service)
API rate limits
Network dependency
Slower than local parsers

Documentation: https://docs.cloud.llamaindex.ai/llamaparse

Setup:

./scripts/setup-llamaparse.sh

Usage Pattern:

from llama_parse import LlamaParse

parser = LlamaParse(
    api_key="llx-...",
    result_type="markdown",  # or "text"
    language="en",
    verbose=True
)

documents = parser.load_data("document.pdf")
for doc in documents:
    print(doc.text)

2. Unstructured.io (Local Processing)

Best For:

Batch processing many documents
Multiple format support (PDF, DOCX, HTML, PPTX, Images)
Local processing without API dependencies
Structured element extraction
Production RAG pipelines

Pros:

Open-source and free
Multi-format support
Runs locally (no API keys)
Good table detection
Element-based chunking

Cons:

Requires system dependencies (poppler, tesseract)
Complex installation
Less accurate than LlamaParse for complex layouts

Documentation: https://unstructured-io.github.io/unstructured/

Setup:

./scripts/setup-unstructured.sh

Usage Pattern:

from unstructured.partition.auto import partition

elements = partition("document.pdf")
for element in elements:
    print(f"{element.category}: {element.text}")

3. PyPDF2 (Simple PDF Text Extraction)

Best For:

Simple text-based PDFs
Quick prototyping
Metadata extraction
PDF manipulation (merge, split)

Pros:

Pure Python (no dependencies)
Fast and lightweight
Good for simple PDFs
Active maintenance

Cons:

Poor table extraction
Struggles with complex layouts
No OCR support
Limited formatting preservation

Documentation: https://github.com/py-pdf/pypdf2

Setup:

pip install pypdf2

Usage Pattern:

from PyPDF2 import PdfReader

reader = PdfReader("document.pdf")
for page in reader.pages:
    print(page.extract_text())

4. PDFPlumber (Advanced PDF Analysis)

Best For:

Table extraction from PDFs
PDF with tabular data
Financial statements and reports
Coordinate-based extraction

Pros:

Excellent table extraction
Visual debugging tools
Coordinate-level control
Metadata and layout info

Cons:

Slower than PyPDF2
Requires pdfminer.six dependency
No OCR support

Documentation: https://github.com/jsvine/pdfplumber

Setup:

pip install pdfplumber

Usage Pattern:

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        text = page.extract_text()

5. python-docx (Word Documents)

Best For:

Microsoft Word (.docx) documents
Extracting paragraphs, tables, headers
Document metadata
Template-based document generation

Pros:

Native DOCX support
Preserves structure (paragraphs, tables, sections)
Access to styles and formatting
Can also write/modify DOCX

Cons:

Only works with .docx (not .doc)
Limited image extraction

Documentation: https://github.com/python-openxml/python-docx

Setup:

pip install python-docx

Usage Pattern:

from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(para.text)
for table in doc.tables:
    for row in table.rows:
        print([cell.text for cell in row.cells])

Decision Matrix

Use Case	Recommended Parser	Alternative
Simple PDF text extraction	PyPDF2	Unstructured
Complex PDFs with tables	LlamaParse	PDFPlumber
Scanned documents (OCR)	LlamaParse	Unstructured + Tesseract
Word documents (.docx)	python-docx	Unstructured
HTML to text	parse-html.py	Unstructured
Multi-format batch processing	Unstructured	Multi-format-parser
Table extraction	PDFPlumber	LlamaParse
Research papers	LlamaParse	Unstructured
Legal documents	LlamaParse	PDFPlumber
Production RAG pipeline	Unstructured	LlamaParse

Functional Scripts

1. Parse PDF (`scripts/parse-pdf.py`)

Command-line PDF parser supporting multiple backends:

# Using PyPDF2 (default)
python scripts/parse-pdf.py document.pdf

# Using PDFPlumber (better for tables)
python scripts/parse-pdf.py document.pdf --backend pdfplumber

# Using LlamaParse (AI-powered)
python scripts/parse-pdf.py document.pdf --backend llamaparse --api-key llx-...

# Output to file
python scripts/parse-pdf.py document.pdf --output output.txt

# Extract tables as JSON
python scripts/parse-pdf.py document.pdf --backend pdfplumber --tables-only --output tables.json

Features:

Multiple backend support (PyPDF2, PDFPlumber, LlamaParse)
Table extraction
Metadata extraction
Page range selection
JSON/Text output formats

2. Parse DOCX (`scripts/parse-docx.py`)

Word document parser with structure preservation:

# Basic extraction
python scripts/parse-docx.py document.docx

# Extract with structure
python scripts/parse-docx.py document.docx --preserve-structure

# Extract tables only
python scripts/parse-docx.py document.docx --tables-only

# Output as JSON
python scripts/parse-docx.py document.docx --output output.json --format json

Features:

Paragraph extraction with styles
Table extraction
Header/footer extraction
Metadata (author, created date, etc.)
Structured JSON output

3. Parse HTML (`scripts/parse-html.py`)

HTML to clean text converter:

# Basic HTML parsing
python scripts/parse-html.py document.html

# From URL
python scripts/parse-html.py https://example.com/article

# Preserve links
python scripts/parse-html.py document.html --preserve-links

# Extract specific selector
python scripts/parse-html.py document.html --selector "article.content"

Features:

Clean text extraction (removes scripts, styles)
Link preservation
CSS selector support
URL fetching
Markdown output option

Templates

Multi-Format Parser (`templates/multi-format-parser.py`)

Universal parser handling multiple formats with automatic format detection:

from multi_format_parser import MultiFormatParser

parser = MultiFormatParser(
    llamaparse_api_key="llx-...",  # Optional
    use_ocr=True,
    chunk_size=1000
)

# Automatic format detection
result = parser.parse_file("document.pdf")
print(result.text)
print(result.metadata)
print(result.tables)

# Batch processing
results = parser.parse_directory("./documents/")
for filename, result in results.items():
    print(f"{filename}: {len(result.text)} characters")

Supports:

PDF, DOCX, HTML, Markdown, TXT
Automatic chunking for RAG
Metadata extraction
Table extraction across all formats
Error handling and fallbacks

Table Extraction (`templates/table-extraction.py`)

Specialized table extraction with multiple strategies:

from table_extraction import TableExtractor

extractor = TableExtractor(
    prefer_llamaparse=True,
    fallback_to_pdfplumber=True
)

# Extract all tables from document
tables = extractor.extract_tables("financial_report.pdf")

for i, table in enumerate(tables):
    print(f"Table {i + 1}:")
    print(table.to_markdown())  # or .to_csv(), .to_json()
    print(f"Confidence: {table.confidence}")

Features:

Multiple extraction strategies
Automatic fallback
Table validation
Format conversion (CSV, JSON, Markdown, DataFrame)
Confidence scoring

Examples

Research Paper Parsing (`examples/parse-research-paper.py`)

Complete example for parsing academic papers:

# Extracts title, abstract, sections, citations, tables, figures
python examples/parse-research-paper.py paper.pdf --output paper.json

Extracts:

Title and authors
Abstract
Section structure (Introduction, Methods, Results, etc.)
Citations and references
Tables and figures with captions
Metadata (DOI, publication date, journal)

Legal Document Parsing (`examples/parse-legal-document.py`)

Specialized parser for legal documents:

# Extracts clauses, sections, definitions, parties
python examples/parse-legal-document.py contract.pdf --output contract.json

Extracts:

Document type (contract, agreement, etc.)
Parties involved
Definitions section
Numbered clauses and sections
Signature blocks
Dates and deadlines

RAG Pipeline Integration

Document Chunking for Embeddings

from multi_format_parser import MultiFormatParser

parser = MultiFormatParser(chunk_size=512, chunk_overlap=50)
result = parser.parse_file("document.pdf")

# Chunks ready for embedding
for chunk in result.chunks:
    print(f"Chunk {chunk.id}: {chunk.text[:100]}...")
    print(f"Metadata: {chunk.metadata}")
    # Send to embedding model

Batch Processing Pipeline

import glob
from multi_format_parser import MultiFormatParser

parser = MultiFormatParser()

# Process all documents in directory
for filepath in glob.glob("./documents/**/*", recursive=True):
    try:
        result = parser.parse_file(filepath)
        # Store in vector database
        store_embeddings(result.chunks)
        print(f"â Processed {filepath}")
    except Exception as e:
        print(f"â Failed {filepath}: {e}")

Best Practices

Parser Selection:

Start with PyPDF2 for simple PDFs, upgrade if needed
Use LlamaParse for complex layouts (budget permitting)
Use Unstructured for multi-format production systems
Use PDFPlumber specifically for table extraction

Performance:

Cache parsed results to avoid re-processing
Use batch processing for multiple documents
Consider async processing for large collections
Monitor API rate limits for LlamaParse

Accuracy:

Validate table extraction results
Implement fallback strategies
Log parsing errors for debugging
Use confidence scores when available

RAG Optimization:

Chunk size: 512-1024 tokens for embeddings
Overlap: 10-20% for context preservation
Preserve metadata (page numbers, sections) for retrieval
Clean extracted text (remove headers/footers)

Troubleshooting

PyPDF2 returns garbled text:

Try PDFPlumber or LlamaParse
PDF may have non-standard encoding
Check if PDF is scanned (needs OCR)

Unstructured installation fails:

Install system dependencies: sudo apt-get install poppler-utils tesseract-ocr
On macOS: brew install poppler tesseract

LlamaParse API errors:

Verify API key is correct
Check rate limits in dashboard
Ensure document size is within limits

Table extraction misses columns:

Try different parser (PDFPlumber vs LlamaParse)
Adjust table detection settings
Validate table structure manually

DOCX parsing fails:

Ensure file is .docx not .doc
Check file is not corrupted
Try converting to .docx with LibreOffice

Dependencies

Core:

pip install pypdf2 pdfplumber python-docx beautifulsoup4 lxml markdown

Optional (Unstructured):

pip install unstructured[local-inference]
sudo apt-get install poppler-utils tesseract-ocr  # Linux
brew install poppler tesseract  # macOS

Optional (LlamaParse):

pip install llama-parse
# Requires API key from https://cloud.llamaindex.ai

Supported Formats: PDF, DOCX, HTML, Markdown, TXT Parsers: LlamaParse, Unstructured.io, PyPDF2, PDFPlumber, python-docx Version: 1.0.0

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台

document-parsers

Agent 安装分布

Skill 文档

Document Parsers

Parser Comparison & Selection Guide

1. LlamaParse (AI-Powered Premium)

2. Unstructured.io (Local Processing)

3. PyPDF2 (Simple PDF Text Extraction)

4. PDFPlumber (Advanced PDF Analysis)

5. python-docx (Word Documents)

Decision Matrix

Functional Scripts

1. Parse PDF (scripts/parse-pdf.py)

2. Parse DOCX (scripts/parse-docx.py)

3. Parse HTML (scripts/parse-html.py)

Templates

Multi-Format Parser (templates/multi-format-parser.py)

Table Extraction (templates/table-extraction.py)

Examples

Research Paper Parsing (examples/parse-research-paper.py)

Legal Document Parsing (examples/parse-legal-document.py)

RAG Pipeline Integration

Document Chunking for Embeddings

Batch Processing Pipeline

Best Practices

Troubleshooting

Dependencies

1. Parse PDF (`scripts/parse-pdf.py`)

2. Parse DOCX (`scripts/parse-docx.py`)

3. Parse HTML (`scripts/parse-html.py`)

Multi-Format Parser (`templates/multi-format-parser.py`)

Table Extraction (`templates/table-extraction.py`)

Research Paper Parsing (`examples/parse-research-paper.py`)

Legal Document Parsing (`examples/parse-legal-document.py`)