llm-caching

📁 rshvr/llm-caching 📅 8 days ago
1
总安装量
1
周安装量
#46622
全站排名
安装命令
npx skills add https://github.com/rshvr/llm-caching --skill llm-caching

Agent 安装分布

replit 1
amp 1
opencode 1
kimi-cli 1
github-copilot 1

Skill 文档

LLM Caching

Maximize KV cache reuse to reduce costs and latency.

Core Concept

LLMs compute Key (K) and Value (V) vectors for each token during inference. These encode the model’s “understanding” of context. Caching avoids recomputation.

Level 1: KV Cache (inference)     - Within one generation, reuse previous tokens' K,V
Level 2: Prompt Cache (API)       - Across requests, persist KV state server-side
Level 3: Prefix Sharing (batch)   - Across users/requests, share common prefixes

The Golden Rule

Static content first, variable content last.

[System prompt]         <- cacheable, same every request
[Tool definitions]      <- cacheable
[Few-shot examples]     <- cacheable (same order!)
[Reference documents]   <- cacheable if stable
[User message]          <- variable, at the end

Cache hits require the prefix (beginning) to match exactly. Any difference breaks caching for everything after.

Prompt Structure Template

┌─────────────────────────────────────┐
│  1. System instructions (static)    │  <- cache_control
├─────────────────────────────────────┤
│  2. Tool definitions (static)       │  <- cache_control
├─────────────────────────────────────┤
│  3. Few-shot examples (static)      │  <- cache_control
├─────────────────────────────────────┤
│  4. Documents/context (semi-static) │  <- cache_control if reused
├─────────────────────────────────────┤
│  5. Conversation history (growing)  │  <- cache after N turns
├─────────────────────────────────────┤
│  6. Current user message (variable) │  <- no caching
└─────────────────────────────────────┘

Anti-Patterns

Anti-Pattern Why It Breaks Caching
Variable content early Prefix changes every request
Randomizing few-shot order Different order = different prefix
Timestamps in system prompt Changes every request
User ID in prefix Per-user cache = no sharing
Prompts < minimum threshold Too small to cache (1024 tokens for Claude)
Shuffling tool definitions Tool order is part of prefix

Cost Impact

Operation Typical Pricing Notes
Cache write ~1.25x input One-time, stores KV state
Cache read ~0.1x input 90% savings on cache hit
No caching 1x input Full recomputation every time

Example: 50k token system prompt, 100 requests

  • Without cache: 50k × 100 × $3/1M = $15.00
  • With cache: 50k × $3.75/1M + 50k × 99 × $0.30/1M = $1.67 (89% savings)

Provider References

Cookbooks

Practical examples: references/cookbooks.md

Pattern Key Insight
Web scraping agent Same tools + system prompt, different URLs
RAG pipeline Cache document chunks, vary queries
Multi-turn chat Growing prefix, cache conversation history
Batch processing Same prompt template, different inputs
Agentic tool use Cache tool definitions + examples
Multi-tenant SaaS Shared base prompt, tenant-specific suffix