langchain rag pipeline

📁 langchain-ai/langchain-skills 📅 Jan 1, 1970
17
总安装量
0
周安装量
#20885
全站排名
安装命令
npx skills add https://github.com/langchain-ai/langchain-skills --skill 'LangChain RAG Pipeline'

Skill 文档

Pipeline:

  1. Index: Load → Split → Embed → Store
  2. Retrieve: Query → Embed → Search → Return docs
  3. Generate: Docs + Query → LLM → Response

Key Components:

  • Document Loaders: Ingest data from files, web, databases
  • Text Splitters: Break documents into chunks
  • Embeddings: Convert text to vectors
  • Vector Stores: Store and search embeddings
Vector Store Use Case Persistence
InMemory Testing Memory only
FAISS Local, high performance Disk
Chroma Development Disk
Pinecone Production, managed Cloud

Complete RAG Pipeline

1. Load documents

docs = [ Document(page_content=”LangChain is a framework for LLM apps.”, metadata={}), Document(page_content=”RAG = Retrieval Augmented Generation.”, metadata={}), ]

2. Split documents

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) splits = splitter.split_documents(docs)

3. Create embeddings and store

embeddings = OpenAIEmbeddings(model=”text-embedding-3-small”) vectorstore = InMemoryVectorStore.from_documents(splits, embeddings)

4. Create retriever

retriever = vectorstore.as_retriever(search_kwargs={“k”: 4})

5. Use in RAG

model = ChatOpenAI(model=”gpt-4.1″) query = “What is RAG?” relevant_docs = retriever.invoke(query)

context = “\n\n”.join([doc.page_content for doc in relevant_docs]) response = model.invoke([ {“role”: “system”, “content”: f”Use this context:\n\n{context}”}, {“role”: “user”, “content”: query}, ])

</python>
<typescript>
End-to-end RAG pipeline: load documents, split into chunks, embed, store, retrieve, and generate a response.
```typescript
import { ChatOpenAI, OpenAIEmbeddings } from "@langchain/openai";
import { MemoryVectorStore } from "@langchain/classic/vectorstores/memory";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { Document } from "@langchain/core/documents";

// 1. Load documents
const docs = [
  new Document({ pageContent: "LangChain is a framework for LLM apps.", metadata: {} }),
  new Document({ pageContent: "RAG = Retrieval Augmented Generation.", metadata: {} }),
];

// 2. Split documents
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 500, chunkOverlap: 50 });
const splits = await splitter.splitDocuments(docs);

// 3. Create embeddings and store
const embeddings = new OpenAIEmbeddings({ model: "text-embedding-3-small" });
const vectorstore = await MemoryVectorStore.fromDocuments(splits, embeddings);

// 4. Create retriever
const retriever = vectorstore.asRetriever({ k: 4 });

// 5. Use in RAG
const model = new ChatOpenAI({ model: "gpt-4.1" });
const query = "What is RAG?";
const relevantDocs = await retriever.invoke(query);

const context = relevantDocs.map(doc => doc.pageContent).join("\n\n");
const response = await model.invoke([
  { role: "system", content: `Use this context:\n\n${context}` },
  { role: "user", content: query },
]);

Document Loaders

loader = PyPDFLoader(“./document.pdf”) docs = loader.load() print(f”Loaded {len(docs)} pages”)

</python>
<typescript>
Load a PDF file and extract each page as a separate document.
```typescript
import { PDFLoader } from "langchain/document_loaders/fs/pdf";

const loader = new PDFLoader("./document.pdf");
const docs = await loader.load();
console.log(`Loaded ${docs.length} pages`);

loader = WebBaseLoader(“https://docs.langchain.com“) docs = loader.load()

</python>
<typescript>
Fetch and parse content from a web URL into a document using Cheerio.
```typescript
import { CheerioWebBaseLoader } from "langchain/document_loaders/web/cheerio";

const loader = new CheerioWebBaseLoader("https://docs.langchain.com");
const docs = await loader.load();

Load all text files from directory

loader = DirectoryLoader( “path/to/documents”, glob=”**/*.txt”, # Pattern for files to load loader_cls=TextLoader ) docs = loader.load()

</python>
</ex-loading-directory>

---

## Text Splitting

<ex-text-splitting>
<python>
Split documents into chunks using RecursiveCharacterTextSplitter with configurable size and overlap.
```python
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Characters per chunk
    chunk_overlap=200,      # Overlap for context continuity
    separators=["\n\n", "\n", " ", ""],  # Split hierarchy
)

splits = splitter.split_documents(docs)

Vector Stores

vectorstore = Chroma.from_documents( documents=splits, embedding=OpenAIEmbeddings(), persist_directory=”./chroma_db”, collection_name=”my-collection”, )

Load existing

vectorstore = Chroma( persist_directory=”./chroma_db”, embedding_function=OpenAIEmbeddings(), collection_name=”my-collection”, )

</python>
<typescript>
Create a Chroma vector store connected to a running Chroma server.
```typescript
import { Chroma } from "@langchain/community/vectorstores/chroma";
import { OpenAIEmbeddings } from "@langchain/openai";

const vectorstore = await Chroma.fromDocuments(
  splits,
  new OpenAIEmbeddings(),
  { collectionName: "my-collection", url: "http://localhost:8000" }
);

vectorstore = FAISS.from_documents(splits, embeddings) vectorstore.save_local(“./faiss_index”)

Load (requires allow_dangerous_deserialization)

loaded = FAISS.load_local( “./faiss_index”, embeddings, allow_dangerous_deserialization=True )

</python>
<typescript>
Create a FAISS vector store, save it to disk, and reload it.
```typescript
import { FaissStore } from "@langchain/community/vectorstores/faiss";

const vectorstore = await FaissStore.fromDocuments(splits, embeddings);
await vectorstore.save("./faiss_index");

const loaded = await FaissStore.load("./faiss_index", embeddings);

Retrieval

With scores

results_with_score = vectorstore.similarity_search_with_score(query, k=5) for doc, score in results_with_score: print(f”Score: {score}, Content: {doc.page_content}”)

</python>
<typescript>
Perform similarity search and retrieve results with relevance scores.
```typescript
// Basic search
const results = await vectorstore.similaritySearch(query, 5);

// With scores
const resultsWithScore = await vectorstore.similaritySearchWithScore(query, 5);
for (const [doc, score] of resultsWithScore) {
  console.log(`Score: ${score}, Content: ${doc.pageContent}`);
}

Search with filter

results = vectorstore.similarity_search( “programming”, k=5, filter={“language”: “python”} # Only Python docs )

</python>
</ex-metadata-filtering>

<ex-rag-with-agent>
<python>
Create an agent that uses RAG as a tool for answering questions.
```python
from langchain.agents import create_agent
from langchain.tools import tool

@tool
def search_docs(query: str) -> str:
    """Search documentation for relevant information."""
    docs = retriever.invoke(query)
    return "\n\n".join([d.page_content for d in docs])

agent = create_agent(
    model="gpt-4.1",
    tools=[search_docs],
)

result = agent.invoke({
    "messages": [{"role": "user", "content": "How do I create an agent?"}]
})
  • Chunk size/overlap
  • Embedding model
  • Number of results (k)
  • Metadata filters
  • Search algorithms: Similarity, MMR

What You CANNOT Configure

  • Embedding dimensions (per model)
  • Mix embeddings from different models in same store

CORRECT

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

</python>
<typescript>
Chunk size 500-1500 is typically good.
```typescript
// WRONG: Too small or too large
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 50 });

// CORRECT
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200 });

CORRECT: 10-20% overlap

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

</python>
</fix-chunk-overlap>

<fix-persist-vectorstore>
<python>
Use persistent vector store instead of in-memory to avoid data loss.
```python
# WRONG: InMemory - lost on restart
vectorstore = InMemoryVectorStore.from_documents(docs, embeddings)

# CORRECT
vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./chroma_db")

// CORRECT const vectorstore = await Chroma.fromDocuments(docs, embeddings, { collectionName: “my-collection” });

</typescript>
</fix-persist-vectorstore>

<fix-consistent-embeddings>
<python>
Use the same embedding model for indexing and querying.
```python
# WRONG: Different embeddings for index and query - incompatible!
vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings(model="text-embedding-3-small"))
retriever = vectorstore.as_retriever(embeddings=OpenAIEmbeddings(model="text-embedding-3-large"))

# CORRECT: Same model
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever()  # Uses same embeddings

CORRECT

loaded_store = FAISS.load_local(“./faiss_index”, embeddings, allow_dangerous_deserialization=True)

</python>
</fix-faiss-deserialization>

<fix-dimension-mismatch>
<python>
Ensure embedding dimensions match the vector store index dimensions.
```python
# WRONG: Index has 1536 dimensions but using 512-dim embeddings
pc.create_index(name="idx", dimension=1536, metric="cosine")
vectorstore = PineconeVectorStore.from_documents(
    docs, OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512), index=pc.Index("idx")
)  # Error: dimension mismatch!

# CORRECT: Match dimensions
embeddings = OpenAIEmbeddings()  # Default 1536

CORRECT

from langchain_community.vectorstores import FAISS from langchain_community.document_loaders import PyPDFLoader from langchain_chroma import Chroma from langchain_openai import OpenAIEmbeddings

</python>
</fix-import-packages>