document-processing

📁 eyadsibai/ltk 📅 Jan 28, 2026
25
总安装量
25
周安装量
#7911
全站排名
安装命令
npx skills add https://github.com/eyadsibai/ltk --skill document-processing

Agent 安装分布

claude-code 17
opencode 15
gemini-cli 14
openclaw 14
codex 13
cursor 12

Skill 文档

Document Processing Guide

Work with office documents: PDF, Excel, Word, and PowerPoint.


Format Overview

Format Extension Structure Best For
PDF .pdf Binary/text Reports, forms, archives
Excel .xlsx XML in ZIP Data, calculations, models
Word .docx XML in ZIP Text documents, contracts
PowerPoint .pptx XML in ZIP Presentations, slides

Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.


PDF Processing

PDF Tools

Task Best Tool
Basic read/write pypdf
Text extraction pdfplumber
Table extraction pdfplumber
Create PDFs reportlab
OCR scanned PDFs pytesseract + pdf2image
Command line qpdf, pdftotext

Common Operations

Operation Approach
Merge Loop through files, add pages to writer
Split Create new writer per page
Extract tables Use pdfplumber, convert to DataFrame
Rotate Call .rotate(degrees) on page
Encrypt Use writer’s .encrypt() method
OCR Convert to images, run pytesseract

Excel Processing

Excel Tools

Task Best Tool
Data analysis pandas
Formulas & formatting openpyxl
Simple CSV pandas
Financial models openpyxl

Critical Rule: Use Formulas

Approach Result
Wrong: Calculate in Python, write value Static number, breaks when data changes
Right: Write Excel formula Dynamic, recalculates automatically

Financial Model Standards

Convention Meaning
Blue text Hardcoded inputs
Black text Formulas
Green text Links to other sheets
Yellow fill Needs attention

Common Formula Errors

Error Cause
#REF! Invalid cell reference
#DIV/0! Division by zero
#VALUE! Wrong data type
#NAME? Unknown function name

Word Processing

Word Tools

Task Best Tool
Text extraction pandoc
Create new python-docx or docx-js
Simple edits python-docx
Tracked changes Direct XML editing

Document Structure

File Contains
word/document.xml Main content
word/comments.xml Comments
word/media/ Images

Tracked Changes (Redlining)

Element XML Tag
Deletion <w:del><w:delText>...</w:delText></w:del>
Insertion <w:ins><w:t>...</w:t></w:ins>

Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.


PowerPoint Processing

PowerPoint Tools

Task Best Tool
Text extraction markitdown
Create new pptxgenjs (JS) or python-pptx
Edit existing Direct XML or python-pptx

Slide Structure

Path Contains
ppt/slides/slide{N}.xml Slide content
ppt/notesSlides/ Speaker notes
ppt/slideMasters/ Master templates
ppt/media/ Images

Design Principles

Principle Guideline
Fonts Use web-safe: Arial, Helvetica, Georgia
Layout Two-column preferred, avoid vertical stacking
Hierarchy Size, weight, color for emphasis
Consistency Repeat patterns across slides

Converting Between Formats

Conversion Tool
Any → PDF LibreOffice headless
PDF → Images pdftoppm
DOCX → Markdown pandoc
Any → Text Appropriate extractor

Best Practices

Practice Why
Use formulas in Excel Dynamic calculations
Preserve formatting on edit Don’t lose styles
Test output opens correctly Catch corruption early
Use tracked changes for contracts Audit trail
Extract to markdown for analysis Easier to process

Common Packages

Language Packages
Python pypdf, pdfplumber, openpyxl, python-docx, python-pptx
JavaScript docx, pptxgenjs
CLI pandoc, qpdf, pdftotext, libreoffice