runpod

📁 runpod/io 📅 7 days ago

总安装量

周安装量

#54615

全站排名

安装命令

npx skills add https://docs.runpod.io

Agent 安装分布

amp 1

openclaw 1

opencode 1

cursor 1

codex 1

Skill 文档

Capabilities

RunPod enables agents to deploy, manage, and scale AI/ML workloads on cloud GPU infrastructure. Agents can provision compute resources, deploy containerized applications, manage persistent storage, and integrate with existing ML frameworks and tools. The platform supports three primary compute models: Serverless endpoints for pay-per-second inference, Pods for persistent GPU instances, and Instant Clusters for distributed multi-node training.

Skills

Serverless Endpoints

Deploy inference endpoints: Create REST API endpoints that automatically scale from zero to hundreds of workers based on demand
Queue-based job processing: Submit asynchronous (/run) or synchronous (/runsync) jobs with automatic queuing and result retrieval
Stream results: Enable incremental output streaming for long-running tasks using /stream endpoint
Monitor job status: Check job execution state, queue position, and retrieve results via /status endpoint
Manage job lifecycle: Cancel in-progress jobs (/cancel), retry failed jobs (/retry), purge queues (/purge-queue)
Health monitoring: Check endpoint operational status including worker availability and job statistics via /health
Webhook notifications: Configure webhooks to receive job completion notifications
Rate limiting: Handle dynamic rate limits that scale with worker count (base limit + per-worker limits)

Handler Functions

Standard handlers: Process inputs synchronously and return results on completion
Streaming handlers: Yield results incrementally as they become available
Async handlers: Process operations concurrently using Python async/await patterns
Concurrent handlers: Handle multiple requests simultaneously within a single worker
Progress updates: Send progress notifications during job execution
Worker refresh: Clear worker state after job completion for clean execution environment
Error handling: Capture exceptions and return custom error responses

vLLM Workers

Deploy large language models: Serve any Hugging Face model with minimal configuration
OpenAI API compatibility: Use existing OpenAI client code by changing endpoint URL and API key
PagedAttention optimization: Leverage memory-efficient KV cache management for higher throughput
Continuous batching: Process multiple requests simultaneously for improved latency
Model caching: Pre-cache models to reduce cold start times
Quantization support: Deploy quantized models (AWQ, GPTQ) for reduced memory usage
Tensor parallelism: Distribute large models across multiple GPUs
Environment configuration: Customize model parameters via environment variables

Pods (Persistent Instances)

Deploy GPU instances: Launch persistent compute instances with configurable GPU types and quantities
Container management: Deploy custom Docker containers or use pre-configured templates
SSH access: Connect directly to Pods for development and debugging
JupyterLab integration: Access web-based IDE for data science workflows
IDE integration: Connect VSCode or Cursor for remote development
Port exposure: Configure HTTP/TCP ports for web services and applications
Storage management: Attach persistent volume disks and network volumes
Pod templates: Use pre-configured environments for common frameworks (PyTorch, TensorFlow, etc.)
Lifecycle control: Start, stop, restart, and reset Pods programmatically

Storage

Network volumes: Create persistent, portable storage independent of compute resources
Volume attachment: Attach network volumes to Pods, Serverless endpoints, and Instant Clusters
S3-compatible API: Access and manage files via S3 API without running compute resources
Data migration: Transfer files between network volumes using runpodctl or rsync
Cloud sync: Synchronize Pod data with major cloud providers
File transfer: Upload/download files between local machine and Pods

Instant Clusters

Multi-node clusters: Deploy 2-8 node clusters (16-64 GPUs) with high-speed networking
Distributed training: Run PyTorch distributed training across multiple GPUs
Slurm clusters: Deploy managed Slurm clusters for job scheduling and resource allocation
Axolotl fine-tuning: Fine-tune large language models across multiple GPUs
High-speed networking: Leverage 1600-3200 Gbps inter-node connectivity
Environment variables: Access pre-configured cluster metadata (PRIMARY_ADDR, NODE_RANK, WORLD_SIZE, etc.)
NCCL configuration: Automatic setup for multi-node communication

REST API

Pod management: Create, list, update, delete, start, stop, restart, and reset Pods
Endpoint management: Deploy, configure, and manage Serverless endpoints
Network volume management: Create and manage persistent storage volumes
Template management: Save and reuse Pod/endpoint configurations
Container registry auth: Securely connect to private Docker registries
Billing queries: Access detailed billing and usage information
OpenAPI schema: Retrieve complete API specification for integration

SDKs

Python SDK: Full-featured SDK for endpoint management, job submission, and status polling
JavaScript SDK: Node.js SDK for endpoint integration and job management
Go SDK: Go SDK for endpoint operations and job handling
GraphQL API: Query and mutate Pods, endpoints, and templates via GraphQL

CLI (runpodctl)

Pod creation: Deploy Pods with custom configurations via command line
Pod management: List, get details, start, stop, and remove Pods
File transfer: Send and receive files between local machine and Pods
SSH key management: Add and list SSH keys for Pod access
Remote execution: Execute commands on Pods
Configuration management: Store and manage API keys and settings

Public Endpoints

Pre-deployed models: Access ready-to-use AI models without deployment
OpenAI-compatible API: Use standard OpenAI client libraries
Image generation: Stable Diffusion and other image models
Text generation: Large language models for chat and completion
Pay-per-use: Only pay for actual model usage

Hub Integration

Model repository: Browse and deploy pre-configured AI models
GitHub integration: Deploy directly from GitHub repositories with automatic rebuilds
Community solutions: Access community-created tools and workflows
ComfyUI-to-API: Convert ComfyUI workflows to Serverless endpoints

Workflows

Deploy a Serverless Endpoint

Write a handler function that processes input and returns results
Test handler locally using python handler.py or with test input
Create a Dockerfile packaging handler and dependencies
Build and push Docker image to registry (Docker Hub, GitHub Container Registry, etc.)
Deploy endpoint via console or REST API with image URL
Configure endpoint settings (GPU type, worker count, scaling parameters)
Send requests to endpoint using /run (async) or /runsync (sync)
Monitor job status and retrieve results via /status

Deploy a vLLM Endpoint

Navigate to Runpod Hub and find vLLM worker repository
Click Deploy and select desired GPU type
Configure model via environment variables (model name, max length, quantization, etc.)
Create endpoint and wait for initialization
Send requests using OpenAI-compatible API or Runpod native API
Use existing OpenAI client code with only endpoint URL and API key changes

Train Models on Instant Cluster

Deploy Instant Cluster with desired number of nodes and GPU type
Access cluster environment variables (PRIMARY_ADDR, NODE_RANK, WORLD_SIZE, etc.)
Configure NCCL for multi-node communication: export NCCL_SOCKET_IFNAME=ens1
Launch distributed training script using PyTorch DistributedDataParallel or similar
Monitor training progress across nodes
Retrieve trained models from network volume

Manage Persistent Development Environment

Deploy Pod with desired GPU type and template
Attach network volume for persistent storage
Connect via SSH, JupyterLab, or VSCode
Install dependencies and configure environment
Save work to network volume
Stop Pod when not in use (data persists)
Restart Pod later with same environment and data

Migrate Data Between Datacenters

Create network volumes in source and destination datacenters
Deploy Pods in each datacenter with volumes attached
Use runpodctl send on source Pod to initiate transfer
Copy receive command from output
Use runpodctl receive on destination Pod to complete transfer
Verify data integrity with disk usage checks

Integration

RunPod integrates with:

Hugging Face: Deploy any Hugging Face model directly via vLLM
GitHub: Automatic deployment and rebuilds from GitHub repositories
Docker registries: Pull images from Docker Hub, GitHub Container Registry, Amazon ECR
OpenAI libraries: Drop-in replacement for OpenAI API via vLLM endpoints
S3-compatible storage: MinIO, Backblaze B2, DigitalOcean Spaces, AWS S3
Cloud providers: Sync Pod data with AWS, Google Cloud, Azure
dstack: Simplified Pod orchestration for AI/ML workloads
SkyPilot: Multi-cloud execution framework
Mods: AI-powered command-line tool
PyTorch: Distributed training via Instant Clusters
TensorFlow: Multi-node training support
Slurm: Job scheduling on Instant Clusters
Axolotl: LLM fine-tuning framework

Context

Billing Models

Serverless: Pay-per-second when workers are active, no idle costs
Pods: Billed by the minute, continuous availability
Network volumes: $0.07/GB/month for first 1TB, $0.05/GB/month beyond
Instant Clusters: Custom pricing for enterprise workloads

GPU Pools

Available GPU types organized by memory: AMPERE_16 (A4000, RTX 4000), AMPERE_24 (L4, A5000), ADA_24 (4090), AMPERE_48 (A6000, A40), ADA_48_PRO (L40, L40S), AMPERE_80 (A100), ADA_80_PRO (H100), HOPPER_141 (H200)

Job States

Jobs progress through states: IN_QUEUE â IN_PROGRESS â COMPLETED/FAILED/TIMED_OUT. Results retained for 30 minutes (async) or 1 minute (sync).

Cold Starts

Initial worker startup time depends on model size and dependencies. Reduce via model caching, FlashBoot, or pre-warming workers.

Rate Limits

Dynamic rate limiting scales with worker count. Base limits: 2000 req/10s for /runsync, 1000 req/10s for /run, with additional per-worker allowances.

Payload Limits

Maximum request sizes: 10 MB for /run, 20 MB for /runsync. Store large results in cloud storage and return links.

Pod Types

Secure Cloud (T3/T4 datacenters, high reliability) vs Community Cloud (peer-to-peer, competitive pricing)

Networking

Pods support TCP and HTTP connections. UDP not supported. Global networking available for cross-datacenter connectivity.

For additional documentation and navigation, see: https://docs.runpod.io/llms.txt

← 返回陌讯 Skills 聚合平台