convert-pdf-to-epub
npx skills add https://github.com/koreyba/claude-skill-pdf-to-epub --skill convert-pdf-to-epub
Agent 安装分布
Skill 文档
PDF to EPUB Converter
Convert PDF documents to high-quality EPUB files with automatic chapter detection, image optimization, and footnote hyperlinking.
Quick Start
# Run conversion (from skill directory)
python -m scripts.convert input.pdf output.epub
# Validate result
python -m scripts.validate input.pdf output.epub
Workflow Overview
The conversion follows a 3-phase process with an optional 4th phase for adaptation:
âââââââââââââââââââ âââââââââââââââââââ âââââââââââââââââââ
â 1. ANALYZE â ââ⺠â 2. CONVERT â ââ⺠â 3. VALIDATE â
â - PDF structureâ â - Apply config â â - Check qualityâ
â - Generate cfg â â - Build EPUB â â - Report issuesâ
âââââââââââââââââââ âââââââââââââââââââ âââââââââââââââââââ
â
â¼
âââââââââââââââââââ
â 4. ADAPT â
â (if needed) â
â - Tune config â
â - Modify code â
âââââââââââââââââââ
Phase 1: Analyze
Before converting, analyze the PDF to determine the best configuration:
# Open PDF and examine structure
import fitz # pymupdf
doc = fitz.open("input.pdf")
# Check for:
# 1. Number of pages
# 2. Presence of images
# 3. Multi-column layout (compare text block x-coordinates)
# 4. Footnotes/endnotes (numbers in margins or at page bottom)
# 5. Font sizes (for heading detection thresholds)
Generate initial config based on analysis:
- Fiction book: Use default
y_sortreading order - Academic paper: Enable
xy_cutfor columns - Magazine: Enable image optimization, use
xy_cut
Ask user to confirm the proposed configuration before proceeding.
Phase 2: Convert
Run the conversion with the generated config:
from conversion.converter import Converter
from conversion.models import ConversionConfig
config = ConversionConfig(
page_ranges=PageRanges(skip=[1, 2], content=(3, -3)),
exclude_regions=ExcludeRegions(top=0.05, bottom=0.05),
reading_order_strategy="y_sort", # or "xy_cut" for columns
image_optimization=ImageOptimizationConfig(enabled=True),
)
converter = Converter(strategy="simple")
result = converter.convert(pdf_path, epub_path, config)
# Check confidence
if result.reading_order_confidence < 0.7:
print("Warning: Low confidence in reading order")
Phase 3: Validate
Always validate the conversion result:
from validation.completeness_checker import CompletenessChecker
from validation.order_checker import OrderChecker
# Check text completeness
completeness = CompletenessChecker().check(pdf_text, epub_text)
print(f"Completeness: {completeness.score:.1%}") # Should be > 95%
# Check reading order
order = OrderChecker().check(pdf_chunks, epub_chunks)
print(f"Order score: {order.score:.1%}") # Should be > 80%
Quality gates:
- Completeness < 95%: Text is being lost
- Order score < 80%: Reading order is wrong
Phase 4: Adapt (if validation fails)
See Decision Tree below.
Decision Tree: When Things Go Wrong
Validation failed?
â
ââ⺠Text loss > 5%?
â ââ⺠Check exclude_regions (headers/footers being cut?)
â â â Try: exclude_regions.top: 0.03 (reduce from 0.05)
â ââ⺠Check page_ranges (skipping too many pages?)
â â â Try: page_ranges.skip: [] (don't skip any)
â ââ⺠Still failing? â See reference/troubleshooting.md#text-loss
â
ââ⺠Wrong reading order?
â ââ⺠PDF has columns?
â â â Try: reading_order_strategy: "xy_cut"
â ââ⺠Columns detected but wrong?
â â â Try: multi_column.threshold: 0.3 (more sensitive)
â ââ⺠Still failing? â See reference/troubleshooting.md#order
â
ââ⺠Headings not detected?
â ââ⺠Headings only slightly larger than body?
â â â Try: heading_detection.font_size_threshold: 1.1
â ââ⺠Custom font patterns?
â â â May need to modify structure_classifier.py (ADAPTABLE)
â
ââ⺠Footnotes not linking?
â ââ⺠Non-standard format (not [1] or (1))?
â â â Add pattern to FootnoteDetector.PATTERNS
â ââ⺠See reference/troubleshooting.md#footnotes
â
ââ⺠Other issue?
ââ⺠See reference/troubleshooting.md
Three-Layer Architecture
The codebase is organized into three layers with different modification policies:
Layer 1: FROZEN (Do Not Modify)
These files implement fixed specifications or deterministic algorithms:
| File | Reason |
|---|---|
core/epub_builder.py |
EPUB3 spec is fixed |
core/text_segmenter.py |
Validation depends on identical chunking |
validation/* |
Metrics must be reproducible |
Never modify these files unless there’s a fundamental bug.
Layer 2: CONFIGURABLE (Try Config First)
Before changing code, try adjusting configuration:
ConversionConfig:
âââ page_ranges # Which pages to process
âââ exclude_regions # Margins to ignore (headers/footers)
âââ multi_column # Column detection settings
âââ reading_order_strategy # "y_sort" or "xy_cut"
âââ heading_detection # Font size thresholds
âââ footnote_processing # Footnote patterns
âââ image_optimization # Compression settings
âââ metadata # Title, author, language
See reference/config-tuning.md for all parameters.
Layer 3: ADAPTABLE (Can Modify If Config Fails)
These files contain heuristics that may need tuning for specific PDFs:
| File | What You Can Modify |
|---|---|
conversion/strategies/* |
Create new strategy subclass |
detectors/structure_classifier.py |
Heading detection heuristics |
detectors/reading_order/* |
Add custom sorter algorithm |
detectors/footnote_detector.py |
Add new footnote patterns |
See reference/code-adaptation.md for guidelines.
Project Structure
<skill-directory>/
âââ SKILL.md # This file
âââ requirements.txt # Python dependencies
âââ core/ # FROZEN: Core algorithms
â âââ epub_builder.py # EPUB3 file creation
â âââ pdf_extractor.py # PDF text/image extraction
â âââ text_segmenter.py # Deterministic chunking
â âââ image_optimizer.py # Image compression
â
âââ conversion/ # Main conversion logic
â âââ converter.py # Orchestrator
â âââ models.py # Data classes & configs
â âââ strategies/ # ADAPTABLE: Conversion strategies
â â âââ base_strategy.py # Template method pattern
â â âââ simple_strategy.py
â âââ detectors/ # ADAPTABLE: Detection heuristics
â âââ structure_classifier.py
â âââ reading_order/
â âââ footnote_detector.py
â âââ endnote_formatter.py
â
âââ validation/ # FROZEN: Quality checking
â âââ completeness_checker.py
â âââ order_checker.py
â
âââ scripts/ # CLI entry points
â âââ analyze.py
â âââ convert.py
â âââ validate.py
â
âââ reference/ # Documentation
â âââ workflow.md
â âââ architecture.md
â âââ troubleshooting.md
â âââ config-tuning.md
â âââ code-adaptation.md
â
âââ examples/ # Example configurations
âââ fiction-simple.json
âââ academic-multicol.json
âââ magazine-images.json
Example Configurations
Fiction Book (simple layout)
{
"page_ranges": {"skip": [1, 2], "content": [3, -3]},
"exclude_regions": {"top": 0.05, "bottom": 0.05},
"reading_order_strategy": "y_sort",
"heading_detection": {"font_size_threshold": 1.2}
}
Academic Paper (2-column)
{
"page_ranges": {"skip": [1], "content": [2, -1]},
"exclude_regions": {"top": 0.08, "bottom": 0.08},
"reading_order_strategy": "xy_cut",
"multi_column": {"enabled": true, "threshold": 0.4}
}
Magazine (images + columns)
{
"reading_order_strategy": "xy_cut",
"multi_column": {"enabled": true, "column_count": 2},
"image_optimization": {
"enabled": true,
"max_width": 800,
"jpeg_quality": 75
}
}
Reference Documentation
For detailed information, see:
- Workflow Details – Complete phase-by-phase guide
- Architecture – Three-layer system explanation
- Troubleshooting – Common problems and solutions
- Config Tuning – All configuration parameters
- Code Adaptation – When and how to modify code
Common Commands
# Full conversion with validation (from skill directory)
python -m scripts.convert input.pdf output.epub && \
python -m scripts.validate input.pdf output.epub
# Analyze PDF structure
python -m scripts.analyze input.pdf
Quality Metrics
After conversion, always check:
| Metric | Good | Warning | Bad |
|---|---|---|---|
| Text completeness | > 98% | 95-98% | < 95% |
| Reading order | > 90% | 80-90% | < 80% |
| Confidence | > 0.8 | 0.6-0.8 | < 0.6 |
If any metric is in “Warning” or “Bad” range, follow the Decision Tree above.