rag-architecture
14
总安装量
9
周安装量
#23344
全站排名
安装命令
npx skills add https://github.com/melodic-software/claude-code-plugins --skill rag-architecture
Agent 安装分布
antigravity
8
claude-code
7
opencode
7
codex
7
windsurf
6
Skill 文档
RAG Architecture
When to Use This Skill
Use this skill when:
- Designing RAG pipelines for LLM applications
- Choosing chunking and embedding strategies
- Optimizing retrieval quality and relevance
- Building knowledge-grounded AI systems
- Implementing hybrid search (dense + sparse)
- Designing multi-stage retrieval pipelines
Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval
RAG Architecture Overview
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â RAG Pipeline â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â â
â ââââââââââââââââ ââââââââââââââââ ââââââââââââââââââââââââ â
â â Ingestion â â Indexing â â Vector Store â â
â â Pipeline âââââ¶â Pipeline âââââ¶â (Embeddings) â â
â ââââââââââââââââ ââââââââââââââââ ââââââââââââââââââââââââ â
â â â â â
â Documents Chunks + Indexed â
â Embeddings Vectors â
â â â
â ââââââââââââââââ ââââââââââââââââ ââââââââââââââââââââââââ â
â â Query â â Retrieval â â Context Assembly â â
â â Processing âââââ¶â Engine âââââ¶â + Generation â â
â ââââââââââââââââ ââââââââââââââââ ââââââââââââââââââââââââ â
â â â â â
â User Query Top-K Chunks LLM Response â
â â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Document Ingestion Pipeline
Document Processing Steps
Raw Documents
â
â¼
âââââââââââââââ
â Extract â â PDF, HTML, DOCX, Markdown
â Content â
âââââââââââââââ
â
â¼
âââââââââââââââ
â Clean & â â Remove boilerplate, normalize
â Normalize â
âââââââââââââââ
â
â¼
âââââââââââââââ
â Chunk â â Split into retrievable units
â Documents â
âââââââââââââââ
â
â¼
âââââââââââââââ
â Generate â â Create vector representations
â Embeddings â
âââââââââââââââ
â
â¼
âââââââââââââââ
â Store â â Persist vectors + metadata
â in Index â
âââââââââââââââ
Chunking Strategies
Strategy Comparison
| Strategy | Description | Best For | Chunk Size |
|---|---|---|---|
| Fixed-size | Split by token/character count | Simple documents | 256-512 tokens |
| Sentence-based | Split at sentence boundaries | Narrative text | Variable |
| Paragraph-based | Split at paragraph boundaries | Structured docs | Variable |
| Semantic | Split by topic/meaning | Long documents | Variable |
| Recursive | Hierarchical splitting | Mixed content | Configurable |
| Document-specific | Custom per doc type | Specialized (code, tables) | Variable |
Chunking Decision Tree
What type of content?
âââ Code
â âââ AST-based or function-level chunking
âââ Tables/Structured
â âââ Keep tables intact, chunk surrounding text
âââ Long narrative
â âââ Semantic or recursive chunking
âââ Short documents (<1 page)
â âââ Whole document as chunk
âââ Mixed content
âââ Recursive with type-specific handlers
Chunk Overlap
Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
â
Information lost at boundary
With Overlap (20%):
[Chunk 1: "The quick brown fox"]
[Chunk 2: "brown fox jumps over"]
â
Context preserved across boundaries
Recommended overlap: 10-20% of chunk size
Chunk Size Trade-offs
Smaller Chunks (128-256 tokens) Larger Chunks (512-1024 tokens)
âââ More precise retrieval âââ More context per chunk
âââ Less context per chunk âââ May include irrelevant content
âââ More chunks to search âââ Fewer chunks to search
âââ Better for factoid Q&A âââ Better for summarization
âââ Higher retrieval recall âââ Higher retrieval precision
Embedding Models
Model Comparison
| Model | Dimensions | Context | Strengths |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8K | High quality, expensive |
| OpenAI text-embedding-3-small | 1536 | 8K | Good quality/cost ratio |
| Cohere embed-v3 | 1024 | 512 | Multilingual, fast |
| BGE-large | 1024 | 512 | Open source, competitive |
| E5-large-v2 | 1024 | 512 | Open source, instruction-tuned |
| GTE-large | 1024 | 512 | Alibaba, good for Chinese |
| Sentence-BERT | 768 | 512 | Classic, well-understood |
Embedding Selection
Need best quality, cost OK?
âââ Yes â OpenAI text-embedding-3-large
âââ No
âââ Need self-hosted/open source?
âââ Yes â BGE-large or E5-large-v2
âââ No
âââ Need multilingual?
âââ Yes â Cohere embed-v3
âââ No â OpenAI text-embedding-3-small
Embedding Optimization
| Technique | Description | When to Use |
|---|---|---|
| Matryoshka embeddings | Truncatable to smaller dims | Memory-constrained |
| Quantized embeddings | INT8/binary embeddings | Large-scale search |
| Instruction-tuned | Prefix with task instruction | Specialized retrieval |
| Fine-tuned embeddings | Domain-specific training | Specialized domains |
Retrieval Strategies
Dense Retrieval (Semantic Search)
Query: "How to deploy containers"
â
â¼
âââââââââââ
â Embed â
â Query â
âââââââââââ
â
â¼
âââââââââââââââââââââââââââââââââââ
â Vector Similarity Search â
â (Cosine, Dot Product, L2) â
âââââââââââââââââââââââââââââââââââ
â
â¼
Top-K semantically similar chunks
Sparse Retrieval (BM25/TF-IDF)
Query: "Kubernetes pod deployment YAML"
â
â¼
âââââââââââ
âTokenize â
â + Score â
âââââââââââ
â
â¼
âââââââââââââââââââââââââââââââââââ
â BM25 Ranking â
â (Term frequency à IDF) â
âââââââââââââââââââââââââââââââââââ
â
â¼
Top-K lexically matching chunks
Hybrid Search (Best of Both)
Query âââ¬âââ¶ Dense Search âââ¬âââ¶ Fusion âââ¶ Final Ranking
â â â
ââââ¶ Sparse Search ââ â
â
Fusion Methods: â¼
⢠RRF (Reciprocal Rank Fusion)
⢠Linear combination
⢠Learned reranking
Reciprocal Rank Fusion (RRF)
RRF Score = Σ 1 / (k + rank_i)
Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result
Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
Result: Doc B ranks higher (better combined relevance)
Multi-Stage Retrieval
Two-Stage Pipeline
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Stage 1: Recall (Fast, High Recall) â
â ⢠ANN search (HNSW, IVF) â
â ⢠Retrieve top-100 candidates â
â ⢠Latency: 10-50ms â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â
â¼
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Stage 2: Rerank (Slow, High Precision) â
â ⢠Cross-encoder or LLM reranking â
â ⢠Score top-100 â return top-10 â
â ⢠Latency: 100-500ms â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Reranking Options
| Reranker | Latency | Quality | Cost |
|---|---|---|---|
| Cross-encoder (local) | Medium | High | Compute |
| Cohere Rerank | Fast | High | API cost |
| LLM-based rerank | Slow | Highest | High API cost |
| BGE-reranker | Fast | Good | Compute |
Context Assembly
Context Window Management
Context Budget: 128K tokens
âââ System prompt: 500 tokens (fixed)
âââ Conversation history: 4K tokens (sliding window)
âââ Retrieved context: 8K tokens (dynamic)
âââ Generation buffer: ~115K tokens (available)
Strategy: Maximize retrieved context quality within budget
Context Assembly Strategies
| Strategy | Description | When to Use |
|---|---|---|
| Simple concatenation | Join top-K chunks | Small context, simple Q&A |
| Relevance-ordered | Most relevant first | General retrieval |
| Chronological | Time-ordered | Temporal queries |
| Hierarchical | Summary + details | Long-form generation |
| Interleaved | Mix sources | Multi-source queries |
Lost-in-the-Middle Problem
LLM Attention Pattern:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Beginning Middle End â
â ââââ ââââ ââââ â
â High attention Low attention High attention â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention
Advanced RAG Patterns
Query Transformation
Original Query: "Tell me about the project"
â
âââââââââââââââââââ¼ââââââââââââââââââ
â¼ â¼ â¼
âââââââââââ ââââââââââââ ââââââââââââ
â HyDE â â Query â â Sub-queryâ
â (Hypo â â Expansionâ â Decomp. â
â Doc) â â â â â
âââââââââââ ââââââââââââ ââââââââââââ
â â â
â¼ â¼ â¼
Hypothetical "project, "What is the
answer to goals, project scope?"
embed timeline, "What are the
deliverables" deliverables?"
HyDE (Hypothetical Document Embeddings)
Query: "How does photosynthesis work?"
â
â¼
âââââââââââââââââ
â LLM generates â
â hypothetical â
â answer â
âââââââââââââââââ
â
â¼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
â
â¼
âââââââââââââââââ
â Embed hypo â
â document â
âââââââââââââââââ
â
â¼
Search with hypothetical embedding
(Better matches actual documents)
Self-RAG (Retrieval-Augmented LM with Self-Reflection)
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â 1. Generate initial response â
â 2. Decide: Need more retrieval? (critique token) â
â âââ Yes â Retrieve more, regenerate â
â âââ No â Check factuality (isRel, isSup tokens) â
â 3. Verify claims against sources â
â 4. Regenerate if needed â
â 5. Return verified response â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Agentic RAG
Query: "Compare Q3 revenue across regions"
â
â¼
âââââââââââââââââ
â Query Agent â
â (Plan steps) â
âââââââââââââââââ
â
âââââââââââââ¼ââââââââââââ
â¼ â¼ â¼
âââââââââ âââââââââ âââââââââ
âSearch â âSearch â âSearch â
â EMEA â â APAC â â AMER â
â docs â â docs â â docs â
âââââââââ âââââââââ âââââââââ
â â â
âââââââââââââ¼ââââââââââââ
â¼
âââââââââââââââââ
â Synthesize â
â Comparison â
âââââââââââââââââ
Evaluation Metrics
Retrieval Metrics
| Metric | Description | Target |
|---|---|---|
| Recall@K | % relevant docs in top-K | >80% |
| Precision@K | % of top-K that are relevant | >60% |
| MRR (Mean Reciprocal Rank) | 1/rank of first relevant | >0.5 |
| NDCG | Graded relevance ranking | >0.7 |
End-to-End Metrics
| Metric | Description | Target |
|---|---|---|
| Answer correctness | Is the answer factually correct? | >90% |
| Faithfulness | Is the answer grounded in context? | >95% |
| Answer relevance | Does it answer the question? | >90% |
| Context relevance | Is retrieved context relevant? | >80% |
Evaluation Framework
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â RAG Evaluation Pipeline â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â 1. Query Set: Representative questions â
â 2. Ground Truth: Expected answers + source docs â
â 3. Metrics: â
â ⢠Retrieval: Recall@K, MRR, NDCG â
â ⢠Generation: Correctness, Faithfulness â
â 4. A/B Testing: Compare configurations â
â 5. Error Analysis: Identify failure patterns â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Common Failure Modes
| Failure Mode | Cause | Mitigation |
|---|---|---|
| Retrieval miss | Query-doc mismatch | Hybrid search, query expansion |
| Wrong chunk | Poor chunking | Better segmentation, overlap |
| Hallucination | Poor grounding | Faithfulness training, citations |
| Lost context | Long-context issues | Hierarchical, summarization |
| Stale data | Outdated index | Incremental updates, TTL |
Scaling Considerations
Index Scaling
| Scale | Approach |
|---|---|
| <1M docs | Single node, exact search |
| 1-10M docs | Single node, HNSW |
| 10-100M docs | Distributed, sharded |
| >100M docs | Distributed + aggressive filtering |
Latency Budget
Typical RAG Pipeline Latency:
Query embedding: 10-50ms
Vector search: 20-100ms
Reranking: 100-300ms
LLM generation: 500-2000ms
ââââââââââââââââââââââââââââ
Total: 630-2450ms
Target p95: <3 seconds for interactive use
Related Skills
llm-serving-patterns– LLM inference infrastructurevector-databases– Vector store selection and optimizationml-system-design– End-to-end ML pipeline designestimation-techniques– Capacity planning for RAG systems
Version History
- v1.0.0 (2025-12-26): Initial release – RAG architecture patterns for systems design
Last Updated
Date: 2025-12-26