pdf-to-txt

📁 lucas-acc/sancho-skills 📅 9 days ago
2
总安装量
2
周安装量
#72073
全站排名
安装命令
npx skills add https://github.com/lucas-acc/sancho-skills --skill pdf-to-txt

Agent 安装分布

replit 2
openclaw 2
mcpjam 1
claude-code 1
windsurf 1
zencoder 1

Skill 文档

PDF to Text Skill

Convert PDF files to plain text format using PyMuPDF4LLM. This tool extracts text content from PDF documents while preserving the reading order and basic formatting.

Features

  • Extract text from any PDF file
  • Preserve reading order and structure
  • Support for multi-page documents
  • Optional Markdown output with formatting hints

Usage

Basic Conversion

Convert a PDF to text file:

python {baseDir}/scripts/convert.py "<pdf_path>"

Output will be saved as <pdf_filename>.txt in the same directory as the PDF.

Specify Output Path

python {baseDir}/scripts/convert.py "<pdf_path>" --output "~/Documents/output.txt"

Convert with Markdown Formatting

Use --markdown flag to get Markdown-formatted output with headers, lists, and other formatting hints:

python {baseDir}/scripts/convert.py "<pdf_path>" --markdown

Page Range Selection

Convert only specific pages:

# Convert pages 1-10 only
python {baseDir}/scripts/convert.py "<pdf_path>" --pages 1-10

# Convert single page
python {baseDir}/scripts/convert.py "<pdf_path>" --pages 5

Examples

# Basic conversion
python {baseDir}/scripts/convert.py "~/Documents/paper.pdf"
# Output: ~/Documents/paper.txt

# With custom output path
python {baseDir}/scripts/convert.py "~/Documents/paper.pdf" --output "~/Notes/paper_content.txt"

# Markdown output
python {baseDir}/scripts/convert.py "~/Documents/paper.pdf" --markdown --output "~/Notes/paper.md"

# Convert first 5 pages only
python {baseDir}/scripts/convert.py "~/Documents/book.pdf" --pages 1-5 --output "~/Notes/chapter1.txt"

Output Format

Plain Text (default)

  • Clean text extraction
  • Preserves paragraph breaks
  • Removes decorative formatting

Markdown (--markdown)

  • Headers marked with #
  • Lists preserved with - or *
  • Bold/italic formatting hints where detectable
  • Better for documents with complex structure

Troubleshooting

“No text found in PDF”

  • The PDF may be scanned images without OCR
  • Try using OCR tools first to add a text layer

Garbled text

  • The PDF may use custom fonts without proper encoding
  • Some PDFs have text stored in unexpected ways

Missing content

  • Complex layouts (multi-column, sidebars) may lose some positioning
  • Forms and interactive elements may not extract cleanly