agent

📁 visionagents/ai 📅 14 days ago

总安装量

周安装量

#31588

全站排名

安装命令

npx skills add https://visionagents.ai

Agent 安装分布

opencode 3

gemini-cli 2

antigravity 2

windsurf 2

claude-code 2

Skill 文档

Capabilities

Vision Agents enables agents to build and deploy intelligent real-time voice and video applications. Agents can create voice assistants, video AI systems, phone bots, and knowledge-powered applications using a modular architecture that supports 25+ AI provider integrations. The framework provides two operational modes: realtime speech-to-speech models for lowest latency, or custom pipelines combining any STT, LLM, and TTS providers.

Skills

Voice Agent Development

Realtime Speech-to-Speech: Build voice agents using native realtime models (OpenAI Realtime, Gemini Live, AWS Nova, Qwen Omni) that handle speech recognition, response generation, and synthesis natively via WebRTC or WebSocket
Custom Voice Pipelines: Mix and match STT providers (Deepgram, Fast-Whisper, Fish, Wizper), LLMs (OpenAI, Gemini, Anthropic, xAI, OpenRouter, HuggingFace), and TTS services (ElevenLabs, Cartesia, Deepgram, Pocket, AWS Polly)
Turn Detection: Implement conversation turn management with built-in providers (Deepgram, Vogent, Smart Turn) or use realtime models with native turn detection
Voice Configuration: Customize voice characteristics, response latency, and audio quality across all supported providers

Video Agent Development

Realtime Video Streaming: Stream video directly to models with native vision support (OpenAI Realtime, Gemini Live) at configurable FPS rates
Vision Language Models (VLMs): Use VLMs for video understanding and analysis with automatic frame buffering (NVIDIA Cosmos, HuggingFace, OpenRouter, Moondream)
Video Processing: Implement computer vision tasks with YOLO pose detection, object detection, segmentation, and custom processors
Frame Analysis: Process video frames at independent FPS rates with custom ML models and publish transformed video back to calls

Function Calling & Tool Integration

Function Registration: Register Python functions with @llm.register_function() decorator for automatic LLM invocation with type-safe parameters
Model Context Protocol (MCP): Connect to local or remote MCP servers for external tool access (GitHub, weather, databases, etc.)
Multi-Round Tool Calling: Support multiple tool-calling rounds where models can call additional tools based on previous results
Tool Events: Monitor tool execution with ToolStartEvent and ToolEndEvent for debugging and metrics

Knowledge & RAG

Gemini File Search: Automatic document chunking, embedding, and retrieval with content deduplication via SHA-256 hashing
TurboPuffer Integration: Hybrid search combining vector (semantic) and BM25 (keyword) search with Reciprocal Rank Fusion
Custom RAG Pipelines: Build document gathering, parsing, chunking, embedding, and reranking workflows
Knowledge Base Registration: Register search functions with LLMs for knowledge-powered conversations

Phone Integration

Twilio Integration: Connect agents to phone calls via Twilio with inbound and outbound calling support
WebSocket Audio Streaming: Bidirectional audio streaming between Twilio and agents via WebSocket
Call Management: Create call registries, manage tokens, and handle webhook validation for incoming calls
Phone + RAG: Combine phone calling with knowledge bases for customer support and information retrieval

Event System & Monitoring

Event Subscription: Subscribe to events using @agent.events.subscribe decorator with type hints for specific event types
Core Events: Monitor audio reception, transcription (STT), LLM responses, turn detection, participant joins/leaves, and call lifecycle
Component Events: Track events from STT, TTS, LLM, turn detection, and custom processors
Error Events: Handle recoverable and non-recoverable errors from each component with retry information
Custom Events: Emit custom events from processors and subscribe to them across the application

Agent Orchestration

Agent Class: Central orchestrator managing conversation flow, audio/video processing, response coordination, and external tool integration
Lifecycle Management: Control agent joining calls, finishing gracefully, and resource cleanup
Simple Response: Send text prompts to LLMs for processing and response generation
Participant Management: Track participant joins/leaves and customize agent behavior based on call composition

Server & Deployment

HTTP Server: Run agents as a production HTTP server with session management and concurrent agent support
Session Management: Create, monitor, and close agent sessions via REST API with metrics endpoints
Health Checks: Liveness (/health) and readiness (/ready) probes for Kubernetes integration
CORS Configuration: Customize allowed origins, methods, headers, and credentials for cross-origin requests
Authentication & Permissions: Implement custom user identification and permission callbacks for session access control
Session Limits: Control maximum concurrent sessions, sessions per call, session duration, and idle timeouts

Deployment & Scaling

Docker Support: CPU and GPU Dockerfiles with Python 3.13 slim base and PyTorch CUDA variants
Kubernetes Ready: Health probes, resource limits, rolling updates, and horizontal scaling support
Environment Configuration: Load API keys from .env files with automatic provider registration
Metrics Export: Prometheus metrics integration for monitoring latency, token usage, and error rates
Session Metrics: Per-session performance metrics including LLM latency, STT/TTS latency, and token counts

Video Debugging

Local Video Override: Test video processing without live cameras using local video files
CLI Override: Pass --video-track-override=/path/to/video.mp4 when running agents
Programmatic Override: Set video override path via agent.set_video_track_override_path()
Looping Playback: Local video files play in a loop at 30 FPS for reproducible testing

Workflows

Building a Voice Agent

Create an Agent instance with edge=getstream.Edge() for transport
Configure LLM: Choose realtime model (e.g., gemini.Realtime()) or custom pipeline with llm=gemini.LLM(), stt=deepgram.STT(), tts=elevenlabs.TTS()
Set system instructions with instructions="Your agent behavior"
Register functions with @llm.register_function() for tool calling
Create a call with call = await agent.create_call(call_type, call_id)
Join the call with async with agent.join(call):
Send initial response with await agent.simple_response("greeting")
Wait for call to end with await agent.finish()

Building a Video Agent with Processing

Create Agent with video-capable LLM (e.g., gemini.Realtime(fps=3))
Add video processors: processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")]
Subscribe to detection events: @agent.events.subscribe async def on_detection(event: VideoProcessorDetectionEvent):
Join call and process video frames automatically
LLM receives annotated frames for analysis and response

Implementing RAG with Agents

Create RAG store: store = gemini.GeminiFilesearchRAG(name="knowledge-base")
Populate with documents: await store.add_directory("./knowledge")
Register search function: @llm.register_function() async def search_knowledge(query: str):
Use TurboPuffer for hybrid search: rag = turbopuffer.TurboPufferRAG(namespace="my-knowledge")
Agent automatically calls search when relevant to conversation

Deploying Agent as HTTP Server

Define create_agent() factory function returning configured Agent
Define join_call() function specifying agent behavior when joining calls
Create AgentLauncher(create_agent=create_agent, join_call=join_call)
Wrap with Runner(launcher) for CLI support
Run with uv run agent.py serve --host 0.0.0.0 --port 8000
Create sessions via POST /sessions with call_type and call_id
Monitor with GET /sessions/{session_id}/metrics

Integrating Phone Calling

Set up Twilio account with phone number and ngrok for local development
Create TwilioCallRegistry to manage pending calls
Implement webhook handler at /twilio/voice to validate Twilio signature
Return TwiML with media stream URL for bidirectional audio
Use attach_phone_to_call() to connect phone audio to Stream call
For outbound calls, use Twilio REST API with media stream TwiML

Monitoring with Events

Subscribe to participant events: @agent.events.subscribe async def on_join(event: CallSessionParticipantJoinedEvent):
Track transcription: @agent.events.subscribe async def on_transcript(event: STTTranscriptEvent):
Monitor LLM responses: @agent.events.subscribe async def on_response(event: LLMResponseCompletedEvent):
Handle tool execution: @agent.events.subscribe async def on_tool_end(event: ToolEndEvent):
Catch errors: @agent.events.subscribe async def on_error(event: STTErrorEvent):

Integration

Vision Agents integrates with 25+ AI providers through a plugin architecture:

LLM Providers: OpenAI (GPT-4, GPT-Realtime), Google Gemini, Anthropic Claude, xAI Grok, OpenRouter, HuggingFace, AWS Bedrock

Realtime Models: OpenAI Realtime (WebRTC), Gemini Live (WebSocket), AWS Nova, Qwen Omni

Speech Services: Deepgram (STT/TTS), ElevenLabs (TTS), Cartesia (TTS), AWS Polly (TTS), Fast-Whisper (STT), Fish (STT), Wizper (STT), Pocket (TTS)

Vision Models: NVIDIA Cosmos, HuggingFace VLMs, Moondream, Roboflow, Ultralytics YOLO

External Tools: Model Context Protocol (MCP) servers for GitHub, weather, databases, and custom services

Transport: Stream’s global edge network (default) or custom transport layers

Deployment: Docker, Kubernetes, FastAPI, Prometheus metrics

Phone: Twilio for inbound/outbound calling

Context

Vision Agents follows a “bring your own keys” (BOYK) model where developers provide their own API credentials for each service. The framework is transport-agnostic, defaulting to Stream’s low-latency edge network but supporting custom transports.

The architecture is modular and event-driven, allowing independent configuration of each component (LLM, STT, TTS, VAD, processors). Realtime models offer the lowest latency for voice interactions by handling speech-to-speech natively, while custom pipelines provide flexibility to mix providers and customize behavior.

Video processing uses a shared VideoForwarder that distributes frames to multiple processors at independent FPS rates, enabling efficient multi-model inference. Processors can be chained for sequential processing (detection â analysis â response).

The framework supports both development (console mode with local video files) and production (HTTP server with session management, metrics, and scaling). Session limits and resource management prevent runaway costs and server exhaustion. All components emit events for monitoring, debugging, and building reactive workflows.

Turn detection can be built-in (realtime models) or external (Deepgram, Vogent, Smart Turn), with configurable sensitivity and latency. Function calling is automaticâLLMs invoke registered functions when relevant, with support for multi-round tool calling and custom execution tracking.

For additional documentation and navigation, see: https://visionagents.ai/llms.txt

← 返回陌讯 Skills 聚合平台