agent
npx skills add https://visionagents.ai
Agent 安装分布
Skill 文档
Capabilities
Vision Agents enables agents to build and deploy intelligent real-time voice and video applications. Agents can create voice assistants, video AI systems, phone bots, and knowledge-powered applications using a modular architecture that supports 25+ AI provider integrations. The framework provides two operational modes: realtime speech-to-speech models for lowest latency, or custom pipelines combining any STT, LLM, and TTS providers.
Skills
Voice Agent Development
- Realtime Speech-to-Speech: Build voice agents using native realtime models (OpenAI Realtime, Gemini Live, AWS Nova, Qwen Omni) that handle speech recognition, response generation, and synthesis natively via WebRTC or WebSocket
- Custom Voice Pipelines: Mix and match STT providers (Deepgram, Fast-Whisper, Fish, Wizper), LLMs (OpenAI, Gemini, Anthropic, xAI, OpenRouter, HuggingFace), and TTS services (ElevenLabs, Cartesia, Deepgram, Pocket, AWS Polly)
- Turn Detection: Implement conversation turn management with built-in providers (Deepgram, Vogent, Smart Turn) or use realtime models with native turn detection
- Voice Configuration: Customize voice characteristics, response latency, and audio quality across all supported providers
Video Agent Development
- Realtime Video Streaming: Stream video directly to models with native vision support (OpenAI Realtime, Gemini Live) at configurable FPS rates
- Vision Language Models (VLMs): Use VLMs for video understanding and analysis with automatic frame buffering (NVIDIA Cosmos, HuggingFace, OpenRouter, Moondream)
- Video Processing: Implement computer vision tasks with YOLO pose detection, object detection, segmentation, and custom processors
- Frame Analysis: Process video frames at independent FPS rates with custom ML models and publish transformed video back to calls
Function Calling & Tool Integration
- Function Registration: Register Python functions with
@llm.register_function()decorator for automatic LLM invocation with type-safe parameters - Model Context Protocol (MCP): Connect to local or remote MCP servers for external tool access (GitHub, weather, databases, etc.)
- Multi-Round Tool Calling: Support multiple tool-calling rounds where models can call additional tools based on previous results
- Tool Events: Monitor tool execution with
ToolStartEventandToolEndEventfor debugging and metrics
Knowledge & RAG
- Gemini File Search: Automatic document chunking, embedding, and retrieval with content deduplication via SHA-256 hashing
- TurboPuffer Integration: Hybrid search combining vector (semantic) and BM25 (keyword) search with Reciprocal Rank Fusion
- Custom RAG Pipelines: Build document gathering, parsing, chunking, embedding, and reranking workflows
- Knowledge Base Registration: Register search functions with LLMs for knowledge-powered conversations
Phone Integration
- Twilio Integration: Connect agents to phone calls via Twilio with inbound and outbound calling support
- WebSocket Audio Streaming: Bidirectional audio streaming between Twilio and agents via WebSocket
- Call Management: Create call registries, manage tokens, and handle webhook validation for incoming calls
- Phone + RAG: Combine phone calling with knowledge bases for customer support and information retrieval
Event System & Monitoring
- Event Subscription: Subscribe to events using
@agent.events.subscribedecorator with type hints for specific event types - Core Events: Monitor audio reception, transcription (STT), LLM responses, turn detection, participant joins/leaves, and call lifecycle
- Component Events: Track events from STT, TTS, LLM, turn detection, and custom processors
- Error Events: Handle recoverable and non-recoverable errors from each component with retry information
- Custom Events: Emit custom events from processors and subscribe to them across the application
Agent Orchestration
- Agent Class: Central orchestrator managing conversation flow, audio/video processing, response coordination, and external tool integration
- Lifecycle Management: Control agent joining calls, finishing gracefully, and resource cleanup
- Simple Response: Send text prompts to LLMs for processing and response generation
- Participant Management: Track participant joins/leaves and customize agent behavior based on call composition
Server & Deployment
- HTTP Server: Run agents as a production HTTP server with session management and concurrent agent support
- Session Management: Create, monitor, and close agent sessions via REST API with metrics endpoints
- Health Checks: Liveness (
/health) and readiness (/ready) probes for Kubernetes integration - CORS Configuration: Customize allowed origins, methods, headers, and credentials for cross-origin requests
- Authentication & Permissions: Implement custom user identification and permission callbacks for session access control
- Session Limits: Control maximum concurrent sessions, sessions per call, session duration, and idle timeouts
Deployment & Scaling
- Docker Support: CPU and GPU Dockerfiles with Python 3.13 slim base and PyTorch CUDA variants
- Kubernetes Ready: Health probes, resource limits, rolling updates, and horizontal scaling support
- Environment Configuration: Load API keys from
.envfiles with automatic provider registration - Metrics Export: Prometheus metrics integration for monitoring latency, token usage, and error rates
- Session Metrics: Per-session performance metrics including LLM latency, STT/TTS latency, and token counts
Video Debugging
- Local Video Override: Test video processing without live cameras using local video files
- CLI Override: Pass
--video-track-override=/path/to/video.mp4when running agents - Programmatic Override: Set video override path via
agent.set_video_track_override_path() - Looping Playback: Local video files play in a loop at 30 FPS for reproducible testing
Workflows
Building a Voice Agent
- Create an Agent instance with
edge=getstream.Edge()for transport - Configure LLM: Choose realtime model (e.g.,
gemini.Realtime()) or custom pipeline withllm=gemini.LLM(),stt=deepgram.STT(),tts=elevenlabs.TTS() - Set system instructions with
instructions="Your agent behavior" - Register functions with
@llm.register_function()for tool calling - Create a call with
call = await agent.create_call(call_type, call_id) - Join the call with
async with agent.join(call): - Send initial response with
await agent.simple_response("greeting") - Wait for call to end with
await agent.finish()
Building a Video Agent with Processing
- Create Agent with video-capable LLM (e.g.,
gemini.Realtime(fps=3)) - Add video processors:
processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")] - Subscribe to detection events:
@agent.events.subscribe async def on_detection(event: VideoProcessorDetectionEvent): - Join call and process video frames automatically
- LLM receives annotated frames for analysis and response
Implementing RAG with Agents
- Create RAG store:
store = gemini.GeminiFilesearchRAG(name="knowledge-base") - Populate with documents:
await store.add_directory("./knowledge") - Register search function:
@llm.register_function() async def search_knowledge(query: str): - Use TurboPuffer for hybrid search:
rag = turbopuffer.TurboPufferRAG(namespace="my-knowledge") - Agent automatically calls search when relevant to conversation
Deploying Agent as HTTP Server
- Define
create_agent()factory function returning configured Agent - Define
join_call()function specifying agent behavior when joining calls - Create
AgentLauncher(create_agent=create_agent, join_call=join_call) - Wrap with
Runner(launcher)for CLI support - Run with
uv run agent.py serve --host 0.0.0.0 --port 8000 - Create sessions via
POST /sessionswithcall_typeandcall_id - Monitor with
GET /sessions/{session_id}/metrics
Integrating Phone Calling
- Set up Twilio account with phone number and ngrok for local development
- Create
TwilioCallRegistryto manage pending calls - Implement webhook handler at
/twilio/voiceto validate Twilio signature - Return TwiML with media stream URL for bidirectional audio
- Use
attach_phone_to_call()to connect phone audio to Stream call - For outbound calls, use Twilio REST API with media stream TwiML
Monitoring with Events
- Subscribe to participant events:
@agent.events.subscribe async def on_join(event: CallSessionParticipantJoinedEvent): - Track transcription:
@agent.events.subscribe async def on_transcript(event: STTTranscriptEvent): - Monitor LLM responses:
@agent.events.subscribe async def on_response(event: LLMResponseCompletedEvent): - Handle tool execution:
@agent.events.subscribe async def on_tool_end(event: ToolEndEvent): - Catch errors:
@agent.events.subscribe async def on_error(event: STTErrorEvent):
Integration
Vision Agents integrates with 25+ AI providers through a plugin architecture:
LLM Providers: OpenAI (GPT-4, GPT-Realtime), Google Gemini, Anthropic Claude, xAI Grok, OpenRouter, HuggingFace, AWS Bedrock
Realtime Models: OpenAI Realtime (WebRTC), Gemini Live (WebSocket), AWS Nova, Qwen Omni
Speech Services: Deepgram (STT/TTS), ElevenLabs (TTS), Cartesia (TTS), AWS Polly (TTS), Fast-Whisper (STT), Fish (STT), Wizper (STT), Pocket (TTS)
Vision Models: NVIDIA Cosmos, HuggingFace VLMs, Moondream, Roboflow, Ultralytics YOLO
External Tools: Model Context Protocol (MCP) servers for GitHub, weather, databases, and custom services
Transport: Stream’s global edge network (default) or custom transport layers
Deployment: Docker, Kubernetes, FastAPI, Prometheus metrics
Phone: Twilio for inbound/outbound calling
Context
Vision Agents follows a “bring your own keys” (BOYK) model where developers provide their own API credentials for each service. The framework is transport-agnostic, defaulting to Stream’s low-latency edge network but supporting custom transports.
The architecture is modular and event-driven, allowing independent configuration of each component (LLM, STT, TTS, VAD, processors). Realtime models offer the lowest latency for voice interactions by handling speech-to-speech natively, while custom pipelines provide flexibility to mix providers and customize behavior.
Video processing uses a shared VideoForwarder that distributes frames to multiple processors at independent FPS rates, enabling efficient multi-model inference. Processors can be chained for sequential processing (detection â analysis â response).
The framework supports both development (console mode with local video files) and production (HTTP server with session management, metrics, and scaling). Session limits and resource management prevent runaway costs and server exhaustion. All components emit events for monitoring, debugging, and building reactive workflows.
Turn detection can be built-in (realtime models) or external (Deepgram, Vogent, Smart Turn), with configurable sensitivity and latency. Function calling is automaticâLLMs invoke registered functions when relevant, with support for multi-round tool calling and custom execution tracking.
For additional documentation and navigation, see: https://visionagents.ai/llms.txt