runpod
npx skills add https://docs.runpod.io
Agent 安装分布
Skill 文档
Capabilities
RunPod enables agents to deploy, manage, and scale AI/ML workloads on cloud GPU infrastructure. Agents can provision compute resources, deploy containerized applications, manage persistent storage, and integrate with existing ML frameworks and tools. The platform supports three primary compute models: Serverless endpoints for pay-per-second inference, Pods for persistent GPU instances, and Instant Clusters for distributed multi-node training.
Skills
Serverless Endpoints
- Deploy inference endpoints: Create REST API endpoints that automatically scale from zero to hundreds of workers based on demand
- Queue-based job processing: Submit asynchronous (
/run) or synchronous (/runsync) jobs with automatic queuing and result retrieval - Stream results: Enable incremental output streaming for long-running tasks using
/streamendpoint - Monitor job status: Check job execution state, queue position, and retrieve results via
/statusendpoint - Manage job lifecycle: Cancel in-progress jobs (
/cancel), retry failed jobs (/retry), purge queues (/purge-queue) - Health monitoring: Check endpoint operational status including worker availability and job statistics via
/health - Webhook notifications: Configure webhooks to receive job completion notifications
- Rate limiting: Handle dynamic rate limits that scale with worker count (base limit + per-worker limits)
Handler Functions
- Standard handlers: Process inputs synchronously and return results on completion
- Streaming handlers: Yield results incrementally as they become available
- Async handlers: Process operations concurrently using Python async/await patterns
- Concurrent handlers: Handle multiple requests simultaneously within a single worker
- Progress updates: Send progress notifications during job execution
- Worker refresh: Clear worker state after job completion for clean execution environment
- Error handling: Capture exceptions and return custom error responses
vLLM Workers
- Deploy large language models: Serve any Hugging Face model with minimal configuration
- OpenAI API compatibility: Use existing OpenAI client code by changing endpoint URL and API key
- PagedAttention optimization: Leverage memory-efficient KV cache management for higher throughput
- Continuous batching: Process multiple requests simultaneously for improved latency
- Model caching: Pre-cache models to reduce cold start times
- Quantization support: Deploy quantized models (AWQ, GPTQ) for reduced memory usage
- Tensor parallelism: Distribute large models across multiple GPUs
- Environment configuration: Customize model parameters via environment variables
Pods (Persistent Instances)
- Deploy GPU instances: Launch persistent compute instances with configurable GPU types and quantities
- Container management: Deploy custom Docker containers or use pre-configured templates
- SSH access: Connect directly to Pods for development and debugging
- JupyterLab integration: Access web-based IDE for data science workflows
- IDE integration: Connect VSCode or Cursor for remote development
- Port exposure: Configure HTTP/TCP ports for web services and applications
- Storage management: Attach persistent volume disks and network volumes
- Pod templates: Use pre-configured environments for common frameworks (PyTorch, TensorFlow, etc.)
- Lifecycle control: Start, stop, restart, and reset Pods programmatically
Storage
- Network volumes: Create persistent, portable storage independent of compute resources
- Volume attachment: Attach network volumes to Pods, Serverless endpoints, and Instant Clusters
- S3-compatible API: Access and manage files via S3 API without running compute resources
- Data migration: Transfer files between network volumes using runpodctl or rsync
- Cloud sync: Synchronize Pod data with major cloud providers
- File transfer: Upload/download files between local machine and Pods
Instant Clusters
- Multi-node clusters: Deploy 2-8 node clusters (16-64 GPUs) with high-speed networking
- Distributed training: Run PyTorch distributed training across multiple GPUs
- Slurm clusters: Deploy managed Slurm clusters for job scheduling and resource allocation
- Axolotl fine-tuning: Fine-tune large language models across multiple GPUs
- High-speed networking: Leverage 1600-3200 Gbps inter-node connectivity
- Environment variables: Access pre-configured cluster metadata (PRIMARY_ADDR, NODE_RANK, WORLD_SIZE, etc.)
- NCCL configuration: Automatic setup for multi-node communication
REST API
- Pod management: Create, list, update, delete, start, stop, restart, and reset Pods
- Endpoint management: Deploy, configure, and manage Serverless endpoints
- Network volume management: Create and manage persistent storage volumes
- Template management: Save and reuse Pod/endpoint configurations
- Container registry auth: Securely connect to private Docker registries
- Billing queries: Access detailed billing and usage information
- OpenAPI schema: Retrieve complete API specification for integration
SDKs
- Python SDK: Full-featured SDK for endpoint management, job submission, and status polling
- JavaScript SDK: Node.js SDK for endpoint integration and job management
- Go SDK: Go SDK for endpoint operations and job handling
- GraphQL API: Query and mutate Pods, endpoints, and templates via GraphQL
CLI (runpodctl)
- Pod creation: Deploy Pods with custom configurations via command line
- Pod management: List, get details, start, stop, and remove Pods
- File transfer: Send and receive files between local machine and Pods
- SSH key management: Add and list SSH keys for Pod access
- Remote execution: Execute commands on Pods
- Configuration management: Store and manage API keys and settings
Public Endpoints
- Pre-deployed models: Access ready-to-use AI models without deployment
- OpenAI-compatible API: Use standard OpenAI client libraries
- Image generation: Stable Diffusion and other image models
- Text generation: Large language models for chat and completion
- Pay-per-use: Only pay for actual model usage
Hub Integration
- Model repository: Browse and deploy pre-configured AI models
- GitHub integration: Deploy directly from GitHub repositories with automatic rebuilds
- Community solutions: Access community-created tools and workflows
- ComfyUI-to-API: Convert ComfyUI workflows to Serverless endpoints
Workflows
Deploy a Serverless Endpoint
- Write a handler function that processes input and returns results
- Test handler locally using
python handler.pyor with test input - Create a Dockerfile packaging handler and dependencies
- Build and push Docker image to registry (Docker Hub, GitHub Container Registry, etc.)
- Deploy endpoint via console or REST API with image URL
- Configure endpoint settings (GPU type, worker count, scaling parameters)
- Send requests to endpoint using
/run(async) or/runsync(sync) - Monitor job status and retrieve results via
/status
Deploy a vLLM Endpoint
- Navigate to Runpod Hub and find vLLM worker repository
- Click Deploy and select desired GPU type
- Configure model via environment variables (model name, max length, quantization, etc.)
- Create endpoint and wait for initialization
- Send requests using OpenAI-compatible API or Runpod native API
- Use existing OpenAI client code with only endpoint URL and API key changes
Train Models on Instant Cluster
- Deploy Instant Cluster with desired number of nodes and GPU type
- Access cluster environment variables (PRIMARY_ADDR, NODE_RANK, WORLD_SIZE, etc.)
- Configure NCCL for multi-node communication:
export NCCL_SOCKET_IFNAME=ens1 - Launch distributed training script using PyTorch DistributedDataParallel or similar
- Monitor training progress across nodes
- Retrieve trained models from network volume
Manage Persistent Development Environment
- Deploy Pod with desired GPU type and template
- Attach network volume for persistent storage
- Connect via SSH, JupyterLab, or VSCode
- Install dependencies and configure environment
- Save work to network volume
- Stop Pod when not in use (data persists)
- Restart Pod later with same environment and data
Migrate Data Between Datacenters
- Create network volumes in source and destination datacenters
- Deploy Pods in each datacenter with volumes attached
- Use
runpodctl sendon source Pod to initiate transfer - Copy receive command from output
- Use
runpodctl receiveon destination Pod to complete transfer - Verify data integrity with disk usage checks
Integration
RunPod integrates with:
- Hugging Face: Deploy any Hugging Face model directly via vLLM
- GitHub: Automatic deployment and rebuilds from GitHub repositories
- Docker registries: Pull images from Docker Hub, GitHub Container Registry, Amazon ECR
- OpenAI libraries: Drop-in replacement for OpenAI API via vLLM endpoints
- S3-compatible storage: MinIO, Backblaze B2, DigitalOcean Spaces, AWS S3
- Cloud providers: Sync Pod data with AWS, Google Cloud, Azure
- dstack: Simplified Pod orchestration for AI/ML workloads
- SkyPilot: Multi-cloud execution framework
- Mods: AI-powered command-line tool
- PyTorch: Distributed training via Instant Clusters
- TensorFlow: Multi-node training support
- Slurm: Job scheduling on Instant Clusters
- Axolotl: LLM fine-tuning framework
Context
Billing Models
- Serverless: Pay-per-second when workers are active, no idle costs
- Pods: Billed by the minute, continuous availability
- Network volumes: $0.07/GB/month for first 1TB, $0.05/GB/month beyond
- Instant Clusters: Custom pricing for enterprise workloads
GPU Pools
Available GPU types organized by memory: AMPERE_16 (A4000, RTX 4000), AMPERE_24 (L4, A5000), ADA_24 (4090), AMPERE_48 (A6000, A40), ADA_48_PRO (L40, L40S), AMPERE_80 (A100), ADA_80_PRO (H100), HOPPER_141 (H200)
Job States
Jobs progress through states: IN_QUEUE â IN_PROGRESS â COMPLETED/FAILED/TIMED_OUT. Results retained for 30 minutes (async) or 1 minute (sync).
Cold Starts
Initial worker startup time depends on model size and dependencies. Reduce via model caching, FlashBoot, or pre-warming workers.
Rate Limits
Dynamic rate limiting scales with worker count. Base limits: 2000 req/10s for /runsync, 1000 req/10s for /run, with additional per-worker allowances.
Payload Limits
Maximum request sizes: 10 MB for /run, 20 MB for /runsync. Store large results in cloud storage and return links.
Pod Types
Secure Cloud (T3/T4 datacenters, high reliability) vs Community Cloud (peer-to-peer, competitive pricing)
Networking
Pods support TCP and HTTP connections. UDP not supported. Global networking available for cross-datacenter connectivity.
For additional documentation and navigation, see: https://docs.runpod.io/llms.txt