convert-pdf-to-epub

📁 koreyba/claude-skill-pdf-to-epub 📅 11 days ago

总安装量

周安装量

#57311

全站排名

安装命令

npx skills add https://github.com/koreyba/claude-skill-pdf-to-epub --skill convert-pdf-to-epub

Agent 安装分布

claude-code 1

Skill 文档

PDF to EPUB Converter

Convert PDF documents to high-quality EPUB files with automatic chapter detection, image optimization, and footnote hyperlinking.

Quick Start

# Run conversion (from skill directory)
python -m scripts.convert input.pdf output.epub

# Validate result
python -m scripts.validate input.pdf output.epub

Workflow Overview

The conversion follows a 3-phase process with an optional 4th phase for adaptation:

âââââââââââââââââââ     âââââââââââââââââââ     âââââââââââââââââââ
â  1. ANALYZE     â âââº â  2. CONVERT     â âââº â  3. VALIDATE    â
â  - PDF structureâ     â  - Apply config â     â  - Check qualityâ
â  - Generate cfg â     â  - Build EPUB   â     â  - Report issuesâ
âââââââââââââââââââ     âââââââââââââââââââ     âââââââââââââââââââ
                                                        â
                                                        â¼
                                              âââââââââââââââââââ
                                              â  4. ADAPT       â
                                              â  (if needed)    â
                                              â  - Tune config  â
                                              â  - Modify code  â
                                              âââââââââââââââââââ

Phase 1: Analyze

Before converting, analyze the PDF to determine the best configuration:

# Open PDF and examine structure
import fitz  # pymupdf
doc = fitz.open("input.pdf")

# Check for:
# 1. Number of pages
# 2. Presence of images
# 3. Multi-column layout (compare text block x-coordinates)
# 4. Footnotes/endnotes (numbers in margins or at page bottom)
# 5. Font sizes (for heading detection thresholds)

Generate initial config based on analysis:

Fiction book: Use default y_sort reading order
Academic paper: Enable xy_cut for columns
Magazine: Enable image optimization, use xy_cut

Ask user to confirm the proposed configuration before proceeding.

Phase 2: Convert

Run the conversion with the generated config:

from conversion.converter import Converter
from conversion.models import ConversionConfig

config = ConversionConfig(
    page_ranges=PageRanges(skip=[1, 2], content=(3, -3)),
    exclude_regions=ExcludeRegions(top=0.05, bottom=0.05),
    reading_order_strategy="y_sort",  # or "xy_cut" for columns
    image_optimization=ImageOptimizationConfig(enabled=True),
)

converter = Converter(strategy="simple")
result = converter.convert(pdf_path, epub_path, config)

# Check confidence
if result.reading_order_confidence < 0.7:
    print("Warning: Low confidence in reading order")

Phase 3: Validate

Always validate the conversion result:

from validation.completeness_checker import CompletenessChecker
from validation.order_checker import OrderChecker

# Check text completeness
completeness = CompletenessChecker().check(pdf_text, epub_text)
print(f"Completeness: {completeness.score:.1%}")  # Should be > 95%

# Check reading order
order = OrderChecker().check(pdf_chunks, epub_chunks)
print(f"Order score: {order.score:.1%}")  # Should be > 80%

Quality gates:

Completeness < 95%: Text is being lost
Order score < 80%: Reading order is wrong

Phase 4: Adapt (if validation fails)

See Decision Tree below.

Decision Tree: When Things Go Wrong

Validation failed?
â
âââº Text loss > 5%?
â   âââº Check exclude_regions (headers/footers being cut?)
â   â   â Try: exclude_regions.top: 0.03 (reduce from 0.05)
â   âââº Check page_ranges (skipping too many pages?)
â   â   â Try: page_ranges.skip: [] (don't skip any)
â   âââº Still failing? â See reference/troubleshooting.md#text-loss
â
âââº Wrong reading order?
â   âââº PDF has columns?
â   â   â Try: reading_order_strategy: "xy_cut"
â   âââº Columns detected but wrong?
â   â   â Try: multi_column.threshold: 0.3 (more sensitive)
â   âââº Still failing? â See reference/troubleshooting.md#order
â
âââº Headings not detected?
â   âââº Headings only slightly larger than body?
â   â   â Try: heading_detection.font_size_threshold: 1.1
â   âââº Custom font patterns?
â   â   â May need to modify structure_classifier.py (ADAPTABLE)
â
âââº Footnotes not linking?
â   âââº Non-standard format (not [1] or (1))?
â   â   â Add pattern to FootnoteDetector.PATTERNS
â   âââº See reference/troubleshooting.md#footnotes
â
âââº Other issue?
    âââº See reference/troubleshooting.md

Three-Layer Architecture

The codebase is organized into three layers with different modification policies:

Layer 1: FROZEN (Do Not Modify)

These files implement fixed specifications or deterministic algorithms:

File	Reason
`core/epub_builder.py`	EPUB3 spec is fixed
`core/text_segmenter.py`	Validation depends on identical chunking
`validation/*`	Metrics must be reproducible

Never modify these files unless there’s a fundamental bug.

Layer 2: CONFIGURABLE (Try Config First)

Before changing code, try adjusting configuration:

ConversionConfig:
âââ page_ranges         # Which pages to process
âââ exclude_regions     # Margins to ignore (headers/footers)
âââ multi_column        # Column detection settings
âââ reading_order_strategy  # "y_sort" or "xy_cut"
âââ heading_detection   # Font size thresholds
âââ footnote_processing # Footnote patterns
âââ image_optimization  # Compression settings
âââ metadata           # Title, author, language

See reference/config-tuning.md for all parameters.

Layer 3: ADAPTABLE (Can Modify If Config Fails)

These files contain heuristics that may need tuning for specific PDFs:

File	What You Can Modify
`conversion/strategies/*`	Create new strategy subclass
`detectors/structure_classifier.py`	Heading detection heuristics
`detectors/reading_order/*`	Add custom sorter algorithm
`detectors/footnote_detector.py`	Add new footnote patterns

See reference/code-adaptation.md for guidelines.

Project Structure

<skill-directory>/
âââ SKILL.md                     # This file
âââ requirements.txt             # Python dependencies
âââ core/                        # FROZEN: Core algorithms
â   âââ epub_builder.py          # EPUB3 file creation
â   âââ pdf_extractor.py         # PDF text/image extraction
â   âââ text_segmenter.py        # Deterministic chunking
â   âââ image_optimizer.py       # Image compression
â
âââ conversion/                  # Main conversion logic
â   âââ converter.py             # Orchestrator
â   âââ models.py                # Data classes & configs
â   âââ strategies/              # ADAPTABLE: Conversion strategies
â   â   âââ base_strategy.py     # Template method pattern
â   â   âââ simple_strategy.py
â   âââ detectors/               # ADAPTABLE: Detection heuristics
â       âââ structure_classifier.py
â       âââ reading_order/
â       âââ footnote_detector.py
â       âââ endnote_formatter.py
â
âââ validation/                  # FROZEN: Quality checking
â   âââ completeness_checker.py
â   âââ order_checker.py
â
âââ scripts/                     # CLI entry points
â   âââ analyze.py
â   âââ convert.py
â   âââ validate.py
â
âââ reference/                   # Documentation
â   âââ workflow.md
â   âââ architecture.md
â   âââ troubleshooting.md
â   âââ config-tuning.md
â   âââ code-adaptation.md
â
âââ examples/                    # Example configurations
    âââ fiction-simple.json
    âââ academic-multicol.json
    âââ magazine-images.json

Example Configurations

Fiction Book (simple layout)

{
  "page_ranges": {"skip": [1, 2], "content": [3, -3]},
  "exclude_regions": {"top": 0.05, "bottom": 0.05},
  "reading_order_strategy": "y_sort",
  "heading_detection": {"font_size_threshold": 1.2}
}

Academic Paper (2-column)

{
  "page_ranges": {"skip": [1], "content": [2, -1]},
  "exclude_regions": {"top": 0.08, "bottom": 0.08},
  "reading_order_strategy": "xy_cut",
  "multi_column": {"enabled": true, "threshold": 0.4}
}

Magazine (images + columns)

{
  "reading_order_strategy": "xy_cut",
  "multi_column": {"enabled": true, "column_count": 2},
  "image_optimization": {
    "enabled": true,
    "max_width": 800,
    "jpeg_quality": 75
  }
}

Reference Documentation

For detailed information, see:

Workflow Details – Complete phase-by-phase guide
Architecture – Three-layer system explanation
Troubleshooting – Common problems and solutions
Config Tuning – All configuration parameters
Code Adaptation – When and how to modify code

Common Commands

# Full conversion with validation (from skill directory)
python -m scripts.convert input.pdf output.epub && \
python -m scripts.validate input.pdf output.epub

# Analyze PDF structure
python -m scripts.analyze input.pdf

Quality Metrics

After conversion, always check:

Metric	Good	Warning	Bad
Text completeness	> 98%	95-98%	< 95%
Reading order	> 90%	80-90%	< 80%
Confidence	> 0.8	0.6-0.8	< 0.6

If any metric is in “Warning” or “Bad” range, follow the Decision Tree above.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台