rag-architecture

📁 melodic-software/claude-code-plugins 📅 Jan 24, 2026

总安装量

周安装量

#23344

全站排名

安装命令

npx skills add https://github.com/melodic-software/claude-code-plugins --skill rag-architecture

Agent 安装分布

antigravity 8

claude-code 7

opencode 7

codex 7

windsurf 6

Skill 文档

RAG Architecture

When to Use This Skill

Use this skill when:

Designing RAG pipelines for LLM applications
Choosing chunking and embedding strategies
Optimizing retrieval quality and relevance
Building knowledge-grounded AI systems
Implementing hybrid search (dense + sparse)
Designing multi-stage retrieval pipelines

Keywords: RAG, retrieval-augmented generation, embeddings, chunking, vector search, semantic search, context window, grounding, knowledge base, hybrid search, reranking, BM25, dense retrieval

RAG Architecture Overview

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                       RAG Pipeline                                  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â                                                                     â
â  ââââââââââââââââ    ââââââââââââââââ    ââââââââââââââââââââââââ  â
â  â   Ingestion  â    â   Indexing   â    â    Vector Store      â  â
â  â   Pipeline   âââââ¶â   Pipeline   âââââ¶â    (Embeddings)      â  â
â  ââââââââââââââââ    ââââââââââââââââ    ââââââââââââââââââââââââ  â
â         â                   â                       â               â
â    Documents           Chunks +                 Indexed             â
â                       Embeddings               Vectors              â
â                                                     â               â
â  ââââââââââââââââ    ââââââââââââââââ    ââââââââââââââââââââââââ  â
â  â    Query     â    â  Retrieval   â    â   Context Assembly   â  â
â  â  Processing  âââââ¶â   Engine     âââââ¶â   + Generation       â  â
â  ââââââââââââââââ    ââââââââââââââââ    ââââââââââââââââââââââââ  â
â         â                   â                       â               â
â    User Query          Top-K Chunks            LLM Response         â
â                                                                     â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Document Ingestion Pipeline

Document Processing Steps

Raw Documents
      â
      â¼
âââââââââââââââ
â   Extract   â â PDF, HTML, DOCX, Markdown
â   Content   â
âââââââââââââââ
      â
      â¼
âââââââââââââââ
â   Clean &   â â Remove boilerplate, normalize
â  Normalize  â
âââââââââââââââ
      â
      â¼
âââââââââââââââ
â   Chunk     â â Split into retrievable units
â  Documents  â
âââââââââââââââ
      â
      â¼
âââââââââââââââ
â  Generate   â â Create vector representations
â Embeddings  â
âââââââââââââââ
      â
      â¼
âââââââââââââââ
â   Store     â â Persist vectors + metadata
â  in Index   â
âââââââââââââââ

Chunking Strategies

Strategy Comparison

Strategy	Description	Best For	Chunk Size
Fixed-size	Split by token/character count	Simple documents	256-512 tokens
Sentence-based	Split at sentence boundaries	Narrative text	Variable
Paragraph-based	Split at paragraph boundaries	Structured docs	Variable
Semantic	Split by topic/meaning	Long documents	Variable
Recursive	Hierarchical splitting	Mixed content	Configurable
Document-specific	Custom per doc type	Specialized (code, tables)	Variable

Chunking Decision Tree

What type of content?
âââ Code
â   âââ AST-based or function-level chunking
âââ Tables/Structured
â   âââ Keep tables intact, chunk surrounding text
âââ Long narrative
â   âââ Semantic or recursive chunking
âââ Short documents (<1 page)
â   âââ Whole document as chunk
âââ Mixed content
    âââ Recursive with type-specific handlers

Chunk Overlap

Without Overlap:
[Chunk 1: "The quick brown"] [Chunk 2: "fox jumps over"]
                             â
               Information lost at boundary

With Overlap (20%):
[Chunk 1: "The quick brown fox"]
                    [Chunk 2: "brown fox jumps over"]
                         â
              Context preserved across boundaries

Recommended overlap: 10-20% of chunk size

Chunk Size Trade-offs

Smaller Chunks (128-256 tokens)        Larger Chunks (512-1024 tokens)
âââ More precise retrieval             âââ More context per chunk
âââ Less context per chunk             âââ May include irrelevant content
âââ More chunks to search              âââ Fewer chunks to search
âââ Better for factoid Q&A             âââ Better for summarization
âââ Higher retrieval recall            âââ Higher retrieval precision

Embedding Models

Model Comparison

Model	Dimensions	Context	Strengths
OpenAI text-embedding-3-large	3072	8K	High quality, expensive
OpenAI text-embedding-3-small	1536	8K	Good quality/cost ratio
Cohere embed-v3	1024	512	Multilingual, fast
BGE-large	1024	512	Open source, competitive
E5-large-v2	1024	512	Open source, instruction-tuned
GTE-large	1024	512	Alibaba, good for Chinese
Sentence-BERT	768	512	Classic, well-understood

Embedding Selection

Need best quality, cost OK?
âââ Yes â OpenAI text-embedding-3-large
âââ No
    âââ Need self-hosted/open source?
        âââ Yes â BGE-large or E5-large-v2
        âââ No
            âââ Need multilingual?
                âââ Yes â Cohere embed-v3
                âââ No â OpenAI text-embedding-3-small

Embedding Optimization

Technique	Description	When to Use
Matryoshka embeddings	Truncatable to smaller dims	Memory-constrained
Quantized embeddings	INT8/binary embeddings	Large-scale search
Instruction-tuned	Prefix with task instruction	Specialized retrieval
Fine-tuned embeddings	Domain-specific training	Specialized domains

Retrieval Strategies

Dense Retrieval (Semantic Search)

Query: "How to deploy containers"
         â
         â¼
    âââââââââââ
    â Embed   â
    â Query   â
    âââââââââââ
         â
         â¼
    âââââââââââââââââââââââââââââââââââ
    â Vector Similarity Search        â
    â (Cosine, Dot Product, L2)       â
    âââââââââââââââââââââââââââââââââââ
         â
         â¼
    Top-K semantically similar chunks

Sparse Retrieval (BM25/TF-IDF)

Query: "Kubernetes pod deployment YAML"
         â
         â¼
    âââââââââââ
    âTokenize â
    â + Score â
    âââââââââââ
         â
         â¼
    âââââââââââââââââââââââââââââââââââ
    â BM25 Ranking                    â
    â (Term frequency Ã IDF)          â
    âââââââââââââââââââââââââââââââââââ
         â
         â¼
    Top-K lexically matching chunks

Hybrid Search (Best of Both)

Query âââ¬âââ¶ Dense Search âââ¬âââ¶ Fusion âââ¶ Final Ranking
        â                   â      â
        ââââ¶ Sparse Search ââ      â
                                   â
        Fusion Methods:            â¼
        â¢ RRF (Reciprocal Rank Fusion)
        â¢ Linear combination
        â¢ Learned reranking

Reciprocal Rank Fusion (RRF)

RRF Score = Î£ 1 / (k + rank_i)

Where:
- k = constant (typically 60)
- rank_i = rank in each retrieval result

Example:
Doc A: Dense rank=1, Sparse rank=5
RRF(A) = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318

Doc B: Dense rank=3, Sparse rank=1
RRF(B) = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323

Result: Doc B ranks higher (better combined relevance)

Multi-Stage Retrieval

Two-Stage Pipeline

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Stage 1: Recall (Fast, High Recall)                     â
â â¢ ANN search (HNSW, IVF)                                â
â â¢ Retrieve top-100 candidates                           â
â â¢ Latency: 10-50ms                                      â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
                         â
                         â¼
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Stage 2: Rerank (Slow, High Precision)                  â
â â¢ Cross-encoder or LLM reranking                        â
â â¢ Score top-100 â return top-10                         â
â â¢ Latency: 100-500ms                                    â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Reranking Options

Reranker	Latency	Quality	Cost
Cross-encoder (local)	Medium	High	Compute
Cohere Rerank	Fast	High	API cost
LLM-based rerank	Slow	Highest	High API cost
BGE-reranker	Fast	Good	Compute

Context Assembly

Context Window Management

Context Budget: 128K tokens
âââ System prompt: 500 tokens (fixed)
âââ Conversation history: 4K tokens (sliding window)
âââ Retrieved context: 8K tokens (dynamic)
âââ Generation buffer: ~115K tokens (available)

Strategy: Maximize retrieved context quality within budget

Context Assembly Strategies

Strategy	Description	When to Use
Simple concatenation	Join top-K chunks	Small context, simple Q&A
Relevance-ordered	Most relevant first	General retrieval
Chronological	Time-ordered	Temporal queries
Hierarchical	Summary + details	Long-form generation
Interleaved	Mix sources	Multi-source queries

Lost-in-the-Middle Problem

LLM Attention Pattern:
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â Beginning           Middle            End               â
â    ââââ              ââââ             ââââ              â
â  High attention   Low attention   High attention        â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Mitigation:
1. Put most relevant at beginning AND end
2. Use shorter context windows when possible
3. Use hierarchical summarization
4. Fine-tune for long-context attention

Advanced RAG Patterns

Query Transformation

Original Query: "Tell me about the project"
                           â
         âââââââââââââââââââ¼ââââââââââââââââââ
         â¼                 â¼                 â¼
    âââââââââââ      ââââââââââââ     ââââââââââââ
    â HyDE    â      â Query    â     â Sub-queryâ
    â (Hypo   â      â Expansionâ     â Decomp.  â
    â Doc)    â      â          â     â          â
    âââââââââââ      ââââââââââââ     ââââââââââââ
         â                 â                 â
         â¼                 â¼                 â¼
    Hypothetical      "project,        "What is the
    answer to         goals,           project scope?"
    embed             timeline,        "What are the
                      deliverables"    deliverables?"

HyDE (Hypothetical Document Embeddings)

Query: "How does photosynthesis work?"
                â
                â¼
        âââââââââââââââââ
        â LLM generates â
        â hypothetical  â
        â answer        â
        âââââââââââââââââ
                â
                â¼
"Photosynthesis is the process by which
plants convert sunlight into energy..."
                â
                â¼
        âââââââââââââââââ
        â Embed hypo    â
        â document      â
        âââââââââââââââââ
                â
                â¼
    Search with hypothetical embedding
    (Better matches actual documents)

Self-RAG (Retrieval-Augmented LM with Self-Reflection)

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â 1. Generate initial response                            â
â 2. Decide: Need more retrieval? (critique token)        â
â    âââ Yes â Retrieve more, regenerate                  â
â    âââ No â Check factuality (isRel, isSup tokens)      â
â 3. Verify claims against sources                        â
â 4. Regenerate if needed                                 â
â 5. Return verified response                             â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Agentic RAG

Query: "Compare Q3 revenue across regions"
                â
                â¼
        âââââââââââââââââ
        â Query Agent   â
        â (Plan steps)  â
        âââââââââââââââââ
                â
    âââââââââââââ¼ââââââââââââ
    â¼           â¼           â¼
âââââââââ   âââââââââ   âââââââââ
âSearch â   âSearch â   âSearch â
â EMEA  â   â APAC  â   â AMER  â
â docs  â   â docs  â   â docs  â
âââââââââ   âââââââââ   âââââââââ
    â           â           â
    âââââââââââââ¼ââââââââââââ
                â¼
        âââââââââââââââââ
        â  Synthesize   â
        â  Comparison   â
        âââââââââââââââââ

Evaluation Metrics

Retrieval Metrics

Metric	Description	Target
Recall@K	% relevant docs in top-K	>80%
Precision@K	% of top-K that are relevant	>60%
MRR (Mean Reciprocal Rank)	1/rank of first relevant	>0.5
NDCG	Graded relevance ranking	>0.7

End-to-End Metrics

Metric	Description	Target
Answer correctness	Is the answer factually correct?	>90%
Faithfulness	Is the answer grounded in context?	>95%
Answer relevance	Does it answer the question?	>90%
Context relevance	Is retrieved context relevant?	>80%

Evaluation Framework

âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â                RAG Evaluation Pipeline                  â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â 1. Query Set: Representative questions                  â
â 2. Ground Truth: Expected answers + source docs         â
â 3. Metrics:                                             â
â    â¢ Retrieval: Recall@K, MRR, NDCG                     â
â    â¢ Generation: Correctness, Faithfulness              â
â 4. A/B Testing: Compare configurations                  â
â 5. Error Analysis: Identify failure patterns            â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ

Common Failure Modes

Failure Mode	Cause	Mitigation
Retrieval miss	Query-doc mismatch	Hybrid search, query expansion
Wrong chunk	Poor chunking	Better segmentation, overlap
Hallucination	Poor grounding	Faithfulness training, citations
Lost context	Long-context issues	Hierarchical, summarization
Stale data	Outdated index	Incremental updates, TTL

Scaling Considerations

Index Scaling

Scale	Approach
<1M docs	Single node, exact search
1-10M docs	Single node, HNSW
10-100M docs	Distributed, sharded
>100M docs	Distributed + aggressive filtering

Latency Budget

Typical RAG Pipeline Latency:

Query embedding:     10-50ms
Vector search:       20-100ms
Reranking:          100-300ms
LLM generation:     500-2000ms
ââââââââââââââââââââââââââââ
Total:              630-2450ms

Target p95: <3 seconds for interactive use

Related Skills

llm-serving-patterns – LLM inference infrastructure
vector-databases – Vector store selection and optimization
ml-system-design – End-to-end ML pipeline design
estimation-techniques – Capacity planning for RAG systems

Version History

v1.0.0 (2025-12-26): Initial release – RAG architecture patterns for systems design

Last Updated

Date: 2025-12-26

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台