runpod-serverless-builder
npx skills add https://github.com/avivk5498/my-claude-code-skills --skill runpod-serverless-builder
Agent 安装分布
Skill 文档
RunPod Serverless Builder
Build end-to-end RunPod serverless endpoints optimized for extremely short cold start times.
Capabilities
Create production-ready RunPod serverless workers for:
- vLLM – High-performance LLM inference
- ComfyUI – Image/video generation with workflow support
- Custom Inference – User-provided Python inference code
Loading Strategies:
- Baked Models: Models embedded in Docker image for fastest cold starts (<5s)
- Dynamic Loading: Models loaded from network storage at runtime (shared across workers)
Quick Start
Use the interactive project generator:
python3 scripts/init_project.py
This generates a complete project with:
- Optimized Dockerfile
- RunPod handler (worker.py)
- Startup scripts (for dynamic loading)
- Configuration files
- Documentation
Project Generation Workflow
Step 1: Run the Generator
Execute the script and answer prompts:
import subprocess
skill_dir = "/path/to/runpod-serverless-builder"
subprocess.run(["python3", f"{skill_dir}/scripts/init_project.py"])
The script prompts for:
- Project name – e.g., “my-vllm-worker”
- Workload type – vLLM, ComfyUI, or Custom
- Loading strategy – Baked or Dynamic
- Model configuration – Model name, quantization, etc.
- Output directory – Where to generate files
Step 2: Customize Generated Files
The generator creates a complete project structure:
my-runpod-worker/
âââ Dockerfile # Optimized for cold starts
âââ worker.py # RunPod handler function
âââ startup.sh # Dynamic loading (if applicable)
âââ requirements.txt # Python dependencies
âââ .dockerignore # Build optimization
âââ .env.example # Environment variables
âââ README.md # Project documentation
Review and customize:
- worker.py: Modify handler logic, add custom processing
- Dockerfile: Add custom dependencies, adjust configurations
- startup.sh: Add custom initialization steps
- requirements.txt: Add additional Python packages
Step 3: Build and Deploy
# Build Docker image
docker build -t my-worker:latest .
# Push to registry
docker push registry/my-worker:latest
# Deploy to RunPod Dashboard
# 1. Create template with image
# 2. Set environment variables
# 3. Create endpoint
Manual Implementation (Without Generator)
If you prefer manual implementation or need to understand the patterns:
vLLM Worker
Baked Model Approach:
- Copy Dockerfile template:
shutil.copy("assets/dockerfiles/vllm_baked.dockerfile", "Dockerfile")
- Copy worker template:
shutil.copy("assets/workers/worker_vllm.py", "worker.py")
- Build with model:
docker build -t my-vllm:latest \
--build-arg MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct" \
--build-arg BASE_PATH="/models" \
.
Dynamic Loading Approach:
- Copy Dockerfile and startup script:
shutil.copy("assets/dockerfiles/vllm_dynamic.dockerfile", "Dockerfile")
shutil.copy("assets/startup_scripts/startup_vllm.sh", "startup.sh")
- Set environment variables in RunPod:
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=hf_your_token
GPU_MEMORY_UTILIZATION=0.95
ComfyUI Worker
Baked Model Approach:
- Use ComfyUI baked template:
shutil.copy("assets/dockerfiles/comfyui_baked.dockerfile", "Dockerfile")
shutil.copy("assets/workers/worker_comfyui.py", "worker.py")
- Modify Dockerfile to download models:
# Add model downloads
RUN aria2c -x 16 -s 16 https://huggingface.co/... \
-d /ComfyUI/models/checkpoints
Dynamic Loading Approach:
- Use dynamic template with startup script:
shutil.copy("assets/dockerfiles/comfyui_dynamic.dockerfile", "Dockerfile")
shutil.copy("assets/startup_scripts/startup_comfyui.sh", "startup.sh")
shutil.copy("assets/config/extra_model_paths.yaml", "extra_model_paths.yaml")
-
Configure network storage paths in extra_model_paths.yaml
-
Set environment variables:
GITHUB_PAT=ghp_token # For private repos
CUSTOM_NODES=https://github.com/org/node1.git,https://github.com/org/node2.git
Custom Inference Worker
- Use custom templates:
shutil.copy("assets/dockerfiles/custom_inference.dockerfile", "Dockerfile")
shutil.copy("assets/workers/worker_custom.py", "worker.py")
- Implement your inference logic in worker.py:
def initialize_model():
# Load your model
return your_model
def handler(job):
model = initialize_model()
# Your inference logic
result = model.predict(job["input"])
return {"result": result}
Handler Patterns
vLLM Handler
from vllm import LLM, SamplingParams
llm = None
def initialize_model():
global llm
if llm is None:
llm = LLM(model=MODEL_NAME, gpu_memory_utilization=0.95)
return llm
def handler(job):
model = initialize_model()
messages = job["input"]["messages"]
# Apply chat template
tokenizer = model.get_tokenizer()
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
# Generate
outputs = model.generate([prompt], SamplingParams(...))
return {"text": outputs[0].outputs[0].text}
ComfyUI Handler
def update_workflow(workflow, parameters):
# Update prompt node
workflow[parameters["prompt_node_id"]]["inputs"]["text"] = parameters["prompt"]
# Update seed node
workflow[parameters["seed_node_id"]]["inputs"]["seed"] = parameters.get("seed", 42)
return workflow
def handler(job):
# Load workflow JSON
with open(job["input"]["workflow_path"]) as f:
workflow = json.load(f)
# Update with parameters
workflow = update_workflow(workflow, job["input"])
# Execute with ComfyUI API
output = execute_comfyui_workflow(workflow)
return {"image_base64": output}
Cold Start Optimization
Key strategies (see references/cold_start_optimization.md for details):
Baked Models Strategy
- Models embedded in image
- Target: <5 second cold starts
- Best for: Small-medium models, latency-critical workloads
Dynamic Loading Strategy
- Models on network storage
- Target: <60 second cold starts
- Best for: Large models, shared across workers
Dockerfile Optimization
# Use BuildKit cache mounts
RUN pip install ...
# Order from least to most frequently changing
COPY requirements.txt /
RUN pip install -r requirements.txt
COPY worker.py /
# Combine commands to reduce layers
RUN apt-get update && apt-get install -y pkg1 pkg2 && apt-get clean
Worker Optimization
# Module-level initialization (runs once per container)
MODEL = load_model() # Cached across warm starts
def handler(job):
# MODEL already loaded for warm starts
return MODEL.predict(job["input"])
Reference Documentation
Consult reference files for detailed guidance:
references/cold_start_optimization.md– Comprehensive cold start optimization strategiesreferences/vllm_guide.md– vLLM configuration, API patterns, troubleshootingreferences/comfyui_guide.md– ComfyUI workflow management, custom nodes, video workflows
Load references when needed:
# For cold start optimization questions
with open("references/cold_start_optimization.md") as f:
cold_start_guide = f.read()
# For vLLM-specific configuration
with open("references/vllm_guide.md") as f:
vllm_guide = f.read()
# For ComfyUI workflow patterns
with open("references/comfyui_guide.md") as f:
comfyui_guide = f.read()
Common Scenarios
Scenario 1: vLLM with Baked Model
User request: “Create a RunPod endpoint for Llama 3.1 8B with the fastest possible cold starts”
Implementation:
- Run init_project.py or copy vllm_baked templates
- Set MODEL_NAME=”meta-llama/Llama-3.1-8B-Instruct” in Dockerfile
- Build with model baked in
- Deploy to RunPod
Scenario 2: ComfyUI with Dynamic Loading
User request: “Build a ComfyUI video generation endpoint that loads models from network storage”
Implementation:
- Run init_project.py selecting ComfyUI + Dynamic
- Configure extra_model_paths.yaml for network storage
- Implement workflow update logic in worker.py
- Deploy with CUSTOM_NODES environment variable
Scenario 3: Custom Inference with User Code
User request: “I have a custom object detection model, help me deploy it to RunPod”
Implementation:
- Run init_project.py selecting Custom
- Copy user’s model code to inference/ directory
- Implement initialize_model() and handler() in worker.py
- Add dependencies to requirements.txt
- Build and deploy
Troubleshooting
Slow Cold Starts
- Check if models are baked vs downloaded at runtime
- Review Dockerfile layer caching
- Minimize dependencies in requirements.txt
- Consult
references/cold_start_optimization.md
Worker Errors
- Check logs in RunPod dashboard
- Test worker.py locally:
python3 worker.py - Verify environment variables in .env.example
- Check model loading in initialize_model()
Build Failures
- Verify base image compatibility
- Check requirements.txt for conflicting versions
- Test Dockerfile locally:
docker build .
Best Practices
- Always use the generator first – It implements proven patterns
- Start with baked models – Optimize for cold starts, then consider dynamic loading if needed
- Pin dependency versions – Avoid “latest” tags and unpinned packages
- Profile cold starts – Measure and optimize based on actual metrics
- Test locally before deploying – Run worker.py and docker build locally
- Consult references – Load reference docs for detailed guidance on specific topics