long-document-llm-pipeline
npx skills add https://github.com/shimo4228/claude-code-learned-skills --skill long-document-llm-pipeline
Agent 安装分布
Skill 文档
Long Document LLM Processing Pipeline
Extracted: 2026-02-08 (updated 2026-02-09) Context: When processing documents over ~50K characters through LLM APIs for extraction, generation, or analysis tasks.
Problem
Sending large documents (>50K chars) as a single LLM prompt causes:
- Lost in the Middle – LLMs lose attention on content in the middle of long inputs (30%+ accuracy drop, per Liu et al. 2023)
- High cost – Entire document becomes input tokens even if only portions are relevant
- No partial retry – If generation fails, must re-process the entire document
- No parallelism – Single sequential API call
Solution: 6-Step Pipeline
Document
|
v
[1] Text Extraction (pymupdf4llm, page_chunks=True)
|
v
[2] Structure Detection (Markdown headers, TOC, Japanese patterns)
|
v
[3] Section Splitting (5K-30K chars per section)
|
v
[4] Breadcrumb Context (prepend section path to each chunk)
|
v
[5] Batch API / Async Parallel (50% cost reduction with Batch)
|
v
[6] Merge + Deduplicate Results
Step 1: Structured Extraction (pymupdf4llm)
Use page_chunks=True to get structured per-page data with metadata:
import pymupdf4llm
# BAD: Flat string, loses structure
text = pymupdf4llm.to_markdown("input.pdf")
# GOOD: Structured per-page data with TOC and metadata
chunks = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)
# Returns: list[dict] with keys:
# - "metadata": {file_path, page_count, page_number, ...}
# - "toc_items": [[level, title, page_number], ...]
# - "text": "# Heading\n\nContent..."
# - "tables": [...], "images": [...], "page_boxes": [...]
Key Parameters
| Parameter | Type | Description |
|---|---|---|
page_chunks |
bool | Return list of page dicts instead of string |
hdr_info |
callable/None | Custom header detection. None = auto-detect by font size |
page_separators |
bool | Insert --- end of page=n --- markers |
margins |
float/seq | Page margins (exclude headers/footers) |
Heading Detection
hdr_info=None auto-detects headings by font size via IdentifyHeaders and prefixes them with # markers. toc_items returns [level, title, page_number] from the PDF’s built-in TOC.
Steps 2-3: Heading-Stack Sectioning with Breadcrumb
Use a dictionary-keyed heading stack where keys are heading levels. When a new heading appears, clear all levels >= its own level, then set the new heading.
heading_stack: dict[int, str] = {}
for heading_text, level in headings:
# Clear deeper/same levels (new H1 clears H2, H3; new H2 clears H3)
keys_to_remove = [k for k in heading_stack if k >= level]
for k in keys_to_remove:
del heading_stack[k]
heading_stack[level] = heading_text
# Build breadcrumb from remaining stack (sorted by level)
stack_list = [document_title] if document_title else []
for lvl in sorted(heading_stack.keys()):
stack_list.append(heading_stack[lvl])
breadcrumb = " > ".join(stack_list)
Behavior
Input: heading_stack breadcrumb
# æ¬è« {1: "æ¬è«"} "æ¬è«"
## 第1ç« {1: "æ¬è«", 2: "第1ç« "} "æ¬è« > 第1ç« "
### 第1ç¯ {1,2,3} "æ¬è« > 第1ç« > 第1ç¯"
## 第2ç« â clears H3 {1: "æ¬è«", 2: "第2ç« "} "æ¬è« > 第2ç« "
# çµè« â clears H2, H3 {1: "çµè«"} "çµè«"
Fallback Chain
1. Markdown headings (#, ##, ###) â preferred
2. Japanese headings (第Xç« , åºè«/æ¬è«/çµè«, 1. etc.) â fallback
3. Single preamble section (level=0) â last resort
Oversized Section Sub-splitting
After heading-based splitting, any section exceeding max_chars gets sub-split at \n\n paragraph boundaries. Sub-sections inherit the parent’s breadcrumb.
Data Model
@dataclass(frozen=True, slots=True)
class Section:
id: str # "section-0", "section-1-2"
heading: str # "第1ç« æ¦è¦"
level: int # 1=H1, 2=H2, 3=H3, 0=preamble
breadcrumb: str # "æ£çã®æµ· > æ¬è« > 第1ç« "
text: str # Section body (including heading line)
page_range: str # "pp.3-18" or ""
char_count: int # len(text), precomputed
Step 4: Breadcrumb Context in Prompts
Always prepend section hierarchy to LLM prompts:
prompt = (
f"Document: {title}\n"
f"Section: {breadcrumb}\n" # e.g., "Chapter 3 > Section 2"
f"Pages: {page_range}\n\n"
f"---\n\n{section_text}"
)
Step 5: API Call Strategy
Decision Matrix: When to Chunk
| Document Size | Approach | Rationale |
|---|---|---|
| < 50K chars | Single prompt | Within attention sweet spot |
| 50K – 200K chars | Structure-aware chunking | Avoid Lost in the Middle |
| > 200K chars | Structure-aware + model routing | Cost optimization critical |
Anthropic Batch API (50% Cost Reduction)
For non-real-time processing:
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request
requests = [
Request(
custom_id=f"section-{i}",
params=MessageCreateParamsNonStreaming(
model="claude-sonnet-4-5-20250929",
max_tokens=8192,
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": section_prompt}],
),
)
for i, section_prompt in enumerate(section_prompts)
]
batch = client.messages.batches.create(requests=requests)
Key facts: 50% discount, max 100K requests/256MB per batch, prompt caching stacks with discount.
Model Routing per Section
def select_model(section_text: str) -> str:
if len(section_text) < 5_000:
return "claude-haiku-4-5-20251001" # Simple/short
return "claude-sonnet-4-5-20250929" # Complex
Cost Example
572K char Japanese document (20 sections):
| Approach | Estimated Cost |
|---|---|
| Single chunk, Sonnet | ~$0.90 |
| Structured + Batch + Sonnet | ~$0.45 |
| Structured + Batch + mixed models | ~$0.35 |
When to Use
- Processing PDFs/documents >50K characters through any LLM API
- Building document-to-X pipelines (flashcards, summaries, Q&A datasets)
- Japanese/multilingual documents with chapter/section structure
- Any task where hierarchical context improves LLM output quality