ai-serving-apis
npx skills add https://github.com/lebsral/dspy-programming-not-prompting-lms-skills --skill ai-serving-apis
Agent 安装分布
Skill 文档
Put Your AI Behind an API
Guide the user through wrapping a DSPy program in a web API so other services (or a frontend) can call it over HTTP. Uses FastAPI for the web layer, with clean separation between DSPy logic and API code.
When you need this
- You built an AI feature and need to serve it as a web endpoint
- You’re adding AI capabilities to an existing backend
- Other services need to call your AI over HTTP
- You want to deploy your AI with Docker
Step 1: Understand the setup
Ask the user:
- What DSPy program are you serving? (classification, RAG, extraction, pipeline, etc.)
- Is it optimized? (do you have an
optimized.jsonfrom/ai-improving-accuracy?) - What endpoints do you need? (single query, batch, health check, etc.)
- Do you have an existing web framework? (FastAPI, Flask, Django â default to FastAPI)
Step 2: Project structure
Recommended layout â keep DSPy logic separate from API code:
project/
âââ program.py # DSPy module (already exists from /ai-kickoff)
âââ server.py # FastAPI app â routes and startup
âââ models.py # Pydantic request/response schemas
âââ config.py # Environment configuration
âââ optimized.json # Saved optimized program (if available)
âââ requirements.txt
âââ Dockerfile
âââ .env.example
Step 3: Define request/response models
Use Pydantic models for all inputs and outputs. This gives you validation, documentation, and serialization for free.
# models.py
from pydantic import BaseModel, Field
class QueryRequest(BaseModel):
"""Request to the AI endpoint."""
query: str = Field(..., description="The input to process", min_length=1)
# Optional: let callers override the model per request
model: str | None = Field(None, description="Override the default LM")
temperature: float | None = Field(None, ge=0, le=2, description="Override temperature")
class QueryResponse(BaseModel):
"""Response from the AI endpoint."""
answer: str
# Include whatever your DSPy program outputs
# reasoning: str | None = None
# confidence: float | None = None
class HealthResponse(BaseModel):
status: str = "ok"
model: str
optimized: bool
Adapt fields to match your DSPy program’s inputs and outputs. If your program takes question and returns answer and reasoning, mirror that in the schemas.
Step 4: Load the optimized program at startup
Load the DSPy program once when the server starts, not on every request. Use FastAPI’s lifespan handler:
# server.py
from contextlib import asynccontextmanager
import dspy
from fastapi import FastAPI
from program import MyProgram
from config import settings
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Load DSPy program once at startup."""
# Configure the default LM
lm = dspy.LM(settings.model_name)
dspy.configure(lm=lm)
# Load the program (with optimization if available)
app.state.program = MyProgram()
app.state.optimized = False
try:
app.state.program.load(settings.program_path)
app.state.optimized = True
print(f"Loaded optimized program from {settings.program_path}")
except FileNotFoundError:
print("Running unoptimized program")
yield # Server runs here
app = FastAPI(title="My AI API", lifespan=lifespan)
Why lifespan? Loading a program + configuring an LM takes time. Doing it once at startup means requests are fast. The app.state object makes the program available to all route handlers.
Step 5: Create endpoints
Query endpoint
# server.py (continued)
from fastapi import HTTPException
from models import QueryRequest, QueryResponse, HealthResponse
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
"""Run the AI program on input."""
program = app.state.program
# If caller wants a different model, use dspy.context for this request only
if request.model or request.temperature is not None:
lm_kwargs = {}
if request.model:
lm_kwargs["model"] = request.model
if request.temperature is not None:
lm_kwargs["temperature"] = request.temperature
override_lm = dspy.LM(**lm_kwargs) if request.model else dspy.LM(
settings.model_name, temperature=request.temperature
)
with dspy.context(lm=override_lm):
result = program(query=request.query)
else:
result = program(query=request.query)
return QueryResponse(answer=result.answer)
Health check
@app.get("/health", response_model=HealthResponse)
async def health():
return HealthResponse(
model=settings.model_name,
optimized=app.state.optimized,
)
Batch endpoint
For processing multiple inputs at once:
@app.post("/query/batch", response_model=list[QueryResponse])
async def query_batch(requests: list[QueryRequest]):
"""Process multiple inputs."""
program = app.state.program
results = []
for req in requests:
result = program(query=req.query)
results.append(QueryResponse(answer=result.answer))
return results
Step 6: Handle errors
Map DSPy errors to appropriate HTTP status codes:
from dspy.primitives.assertions import DSPyAssertionError
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
program = app.state.program
try:
if request.model or request.temperature is not None:
override_lm = dspy.LM(
request.model or settings.model_name,
temperature=request.temperature,
)
with dspy.context(lm=override_lm):
result = program(query=request.query)
else:
result = program(query=request.query)
return QueryResponse(answer=result.answer)
except DSPyAssertionError as e:
# AI output failed validation (from dspy.Assert)
raise HTTPException(status_code=422, detail=f"Output validation failed: {e}")
except Exception as e:
error_msg = str(e).lower()
if "rate limit" in error_msg or "429" in error_msg:
raise HTTPException(status_code=429, detail="Rate limited by AI provider")
if "timeout" in error_msg:
raise HTTPException(status_code=504, detail="AI provider timed out")
raise HTTPException(status_code=500, detail="Internal error processing request")
Step 7: Environment configuration
Use pydantic-settings to manage configuration:
# config.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
model_name: str = "openai/gpt-4o-mini"
program_path: str = "optimized.json"
api_key: str = "" # Set via environment variable
model_config = {"env_prefix": "AI_"}
settings = Settings()
# .env.example
AI_MODEL_NAME=openai/gpt-4o-mini
AI_PROGRAM_PATH=optimized.json
AI_API_KEY=your-api-key-here
Step 8: Run and deploy
Run locally
pip install fastapi uvicorn pydantic-settings
uvicorn server:app --reload --port 8000
Visit http://localhost:8000/docs for auto-generated API docs.
Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
requirements.txt
dspy>=2.5
fastapi>=0.100
uvicorn[standard]
pydantic-settings>=2.0
Add provider-specific packages as needed (e.g., openai, anthropic).
Docker Compose (optional)
# docker-compose.yml
services:
api:
build: .
ports:
- "8000:8000"
env_file: .env
volumes:
- ./optimized.json:/app/optimized.json:ro
Key patterns
- Load once, serve many. Load the program and LM at startup via lifespan, not per request.
- Pydantic everything. Request/response models give you validation, docs, and serialization.
dspy.context()for overrides. Let callers switch models or temperature without affecting other requests.- Separate DSPy from API code. Keep
program.pyindependent â the same module runs in scripts, tests, and the API. - Map errors to HTTP codes. Assertion failures â 422, rate limits â 429, timeouts â 504.
Additional resources
- For worked examples (RAG API, classification API, streaming), see examples.md
- Use
/ai-kickoffto scaffold a new project (add--apistructure with Step 2b) - Use
/ai-searching-docsto build the RAG program to serve - Use
/ai-monitoringto monitor your deployed API - Use
/ai-cutting-coststo optimize API costs in production