long-document-llm-pipeline

📁 shimo4228/claude-code-learned-skills 📅 14 days ago

总安装量

周安装量

#52948

全站排名

安装命令

npx skills add https://github.com/shimo4228/claude-code-learned-skills --skill long-document-llm-pipeline

Agent 安装分布

openclaw 4

mcpjam 3

iflow-cli 3

junie 3

windsurf 3

zencoder 3

Skill 文档

Long Document LLM Processing Pipeline

Extracted: 2026-02-08 (updated 2026-02-09) Context: When processing documents over ~50K characters through LLM APIs for extraction, generation, or analysis tasks.

Problem

Sending large documents (>50K chars) as a single LLM prompt causes:

Lost in the Middle – LLMs lose attention on content in the middle of long inputs (30%+ accuracy drop, per Liu et al. 2023)
High cost – Entire document becomes input tokens even if only portions are relevant
No partial retry – If generation fails, must re-process the entire document
No parallelism – Single sequential API call

Solution: 6-Step Pipeline

Document
  |
  v
[1] Text Extraction (pymupdf4llm, page_chunks=True)
  |
  v
[2] Structure Detection (Markdown headers, TOC, Japanese patterns)
  |
  v
[3] Section Splitting (5K-30K chars per section)
  |
  v
[4] Breadcrumb Context (prepend section path to each chunk)
  |
  v
[5] Batch API / Async Parallel (50% cost reduction with Batch)
  |
  v
[6] Merge + Deduplicate Results

Step 1: Structured Extraction (pymupdf4llm)

Use page_chunks=True to get structured per-page data with metadata:

import pymupdf4llm

# BAD: Flat string, loses structure
text = pymupdf4llm.to_markdown("input.pdf")

# GOOD: Structured per-page data with TOC and metadata
chunks = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)
# Returns: list[dict] with keys:
#   - "metadata": {file_path, page_count, page_number, ...}
#   - "toc_items": [[level, title, page_number], ...]
#   - "text": "# Heading\n\nContent..."
#   - "tables": [...], "images": [...], "page_boxes": [...]

Key Parameters

Parameter	Type	Description
`page_chunks`	bool	Return list of page dicts instead of string
`hdr_info`	callable/None	Custom header detection. `None` = auto-detect by font size
`page_separators`	bool	Insert `--- end of page=n ---` markers
`margins`	float/seq	Page margins (exclude headers/footers)

Heading Detection

hdr_info=None auto-detects headings by font size via IdentifyHeaders and prefixes them with # markers. toc_items returns [level, title, page_number] from the PDF’s built-in TOC.

Steps 2-3: Heading-Stack Sectioning with Breadcrumb

Use a dictionary-keyed heading stack where keys are heading levels. When a new heading appears, clear all levels >= its own level, then set the new heading.

heading_stack: dict[int, str] = {}

for heading_text, level in headings:
    # Clear deeper/same levels (new H1 clears H2, H3; new H2 clears H3)
    keys_to_remove = [k for k in heading_stack if k >= level]
    for k in keys_to_remove:
        del heading_stack[k]
    heading_stack[level] = heading_text

    # Build breadcrumb from remaining stack (sorted by level)
    stack_list = [document_title] if document_title else []
    for lvl in sorted(heading_stack.keys()):
        stack_list.append(heading_stack[lvl])
    breadcrumb = " > ".join(stack_list)

Behavior

Input:                           heading_stack        breadcrumb
# æ¬è«                           {1: "æ¬è«"}          "æ¬è«"
## ç¬¬1ç«                          {1: "æ¬è«", 2: "ç¬¬1ç« "}  "æ¬è« > ç¬¬1ç« "
### ç¬¬1ç¯                        {1,2,3}              "æ¬è« > ç¬¬1ç«  > ç¬¬1ç¯"
## ç¬¬2ç«    â clears H3           {1: "æ¬è«", 2: "ç¬¬2ç« "}  "æ¬è« > ç¬¬2ç« "
# çµè«     â clears H2, H3       {1: "çµè«"}          "çµè«"

Fallback Chain

1. Markdown headings (#, ##, ###) â preferred
2. Japanese headings (ç¬¬Xç« , åºè«/æ¬è«/çµè«, 1. etc.) â fallback
3. Single preamble section (level=0) â last resort

Oversized Section Sub-splitting

After heading-based splitting, any section exceeding max_chars gets sub-split at \n\n paragraph boundaries. Sub-sections inherit the parent’s breadcrumb.

Data Model

@dataclass(frozen=True, slots=True)
class Section:
    id: str           # "section-0", "section-1-2"
    heading: str      # "ç¬¬1ç«  æ¦è¦"
    level: int        # 1=H1, 2=H2, 3=H3, 0=preamble
    breadcrumb: str   # "æ£çã®æµ· > æ¬è« > ç¬¬1ç« "
    text: str         # Section body (including heading line)
    page_range: str   # "pp.3-18" or ""
    char_count: int   # len(text), precomputed

Step 4: Breadcrumb Context in Prompts

Always prepend section hierarchy to LLM prompts:

prompt = (
    f"Document: {title}\n"
    f"Section: {breadcrumb}\n"  # e.g., "Chapter 3 > Section 2"
    f"Pages: {page_range}\n\n"
    f"---\n\n{section_text}"
)

Step 5: API Call Strategy

Decision Matrix: When to Chunk

Document Size	Approach	Rationale
< 50K chars	Single prompt	Within attention sweet spot
50K – 200K chars	Structure-aware chunking	Avoid Lost in the Middle
> 200K chars	Structure-aware + model routing	Cost optimization critical

Anthropic Batch API (50% Cost Reduction)

For non-real-time processing:

from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

requests = [
    Request(
        custom_id=f"section-{i}",
        params=MessageCreateParamsNonStreaming(
            model="claude-sonnet-4-5-20250929",
            max_tokens=8192,
            system=[{
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},
            }],
            messages=[{"role": "user", "content": section_prompt}],
        ),
    )
    for i, section_prompt in enumerate(section_prompts)
]
batch = client.messages.batches.create(requests=requests)

Key facts: 50% discount, max 100K requests/256MB per batch, prompt caching stacks with discount.

Model Routing per Section

def select_model(section_text: str) -> str:
    if len(section_text) < 5_000:
        return "claude-haiku-4-5-20251001"  # Simple/short
    return "claude-sonnet-4-5-20250929"     # Complex

Cost Example

572K char Japanese document (20 sections):

Approach	Estimated Cost
Single chunk, Sonnet	~$0.90
Structured + Batch + Sonnet	~$0.45
Structured + Batch + mixed models	~$0.35

When to Use

Processing PDFs/documents >50K characters through any LLM API
Building document-to-X pipelines (flashcards, summaries, Q&A datasets)
Japanese/multilingual documents with chapter/section structure
Any task where hierarchical context improves LLM output quality

References

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台