llm-caching
1
总安装量
1
周安装量
#46622
全站排名
安装命令
npx skills add https://github.com/rshvr/llm-caching --skill llm-caching
Agent 安装分布
replit
1
amp
1
opencode
1
kimi-cli
1
github-copilot
1
Skill 文档
LLM Caching
Maximize KV cache reuse to reduce costs and latency.
Core Concept
LLMs compute Key (K) and Value (V) vectors for each token during inference. These encode the model’s “understanding” of context. Caching avoids recomputation.
Level 1: KV Cache (inference) - Within one generation, reuse previous tokens' K,V
Level 2: Prompt Cache (API) - Across requests, persist KV state server-side
Level 3: Prefix Sharing (batch) - Across users/requests, share common prefixes
The Golden Rule
Static content first, variable content last.
[System prompt] <- cacheable, same every request
[Tool definitions] <- cacheable
[Few-shot examples] <- cacheable (same order!)
[Reference documents] <- cacheable if stable
[User message] <- variable, at the end
Cache hits require the prefix (beginning) to match exactly. Any difference breaks caching for everything after.
Prompt Structure Template
âââââââââââââââââââââââââââââââââââââââ
â 1. System instructions (static) â <- cache_control
âââââââââââââââââââââââââââââââââââââââ¤
â 2. Tool definitions (static) â <- cache_control
âââââââââââââââââââââââââââââââââââââââ¤
â 3. Few-shot examples (static) â <- cache_control
âââââââââââââââââââââââââââââââââââââââ¤
â 4. Documents/context (semi-static) â <- cache_control if reused
âââââââââââââââââââââââââââââââââââââââ¤
â 5. Conversation history (growing) â <- cache after N turns
âââââââââââââââââââââââââââââââââââââââ¤
â 6. Current user message (variable) â <- no caching
âââââââââââââââââââââââââââââââââââââââ
Anti-Patterns
| Anti-Pattern | Why It Breaks Caching |
|---|---|
| Variable content early | Prefix changes every request |
| Randomizing few-shot order | Different order = different prefix |
| Timestamps in system prompt | Changes every request |
| User ID in prefix | Per-user cache = no sharing |
| Prompts < minimum threshold | Too small to cache (1024 tokens for Claude) |
| Shuffling tool definitions | Tool order is part of prefix |
Cost Impact
| Operation | Typical Pricing | Notes |
|---|---|---|
| Cache write | ~1.25x input | One-time, stores KV state |
| Cache read | ~0.1x input | 90% savings on cache hit |
| No caching | 1x input | Full recomputation every time |
Example: 50k token system prompt, 100 requests
- Without cache: 50k à 100 à $3/1M = $15.00
- With cache: 50k à $3.75/1M + 50k à 99 à $0.30/1M = $1.67 (89% savings)
Provider References
- Anthropic Claude (recommended): references/claude.md
- Cohere: references/cohere.md
- Self-hosted (vLLM, SGLang, Ollama, HuggingFace): references/self-hosted.md
- OpenAI: references/openai.md
- Google Gemini: references/gemini.md
Cookbooks
Practical examples: references/cookbooks.md
| Pattern | Key Insight |
|---|---|
| Web scraping agent | Same tools + system prompt, different URLs |
| RAG pipeline | Cache document chunks, vary queries |
| Multi-turn chat | Growing prefix, cache conversation history |
| Batch processing | Same prompt template, different inputs |
| Agentic tool use | Cache tool definitions + examples |
| Multi-tenant SaaS | Shared base prompt, tenant-specific suffix |