agent

📁 visionagents/ai 📅 14 days ago
3
总安装量
3
周安装量
#31588
全站排名
安装命令
npx skills add https://visionagents.ai

Agent 安装分布

opencode 3
gemini-cli 2
antigravity 2
windsurf 2
claude-code 2

Skill 文档

Capabilities

Vision Agents enables agents to build and deploy intelligent real-time voice and video applications. Agents can create voice assistants, video AI systems, phone bots, and knowledge-powered applications using a modular architecture that supports 25+ AI provider integrations. The framework provides two operational modes: realtime speech-to-speech models for lowest latency, or custom pipelines combining any STT, LLM, and TTS providers.

Skills

Voice Agent Development

  • Realtime Speech-to-Speech: Build voice agents using native realtime models (OpenAI Realtime, Gemini Live, AWS Nova, Qwen Omni) that handle speech recognition, response generation, and synthesis natively via WebRTC or WebSocket
  • Custom Voice Pipelines: Mix and match STT providers (Deepgram, Fast-Whisper, Fish, Wizper), LLMs (OpenAI, Gemini, Anthropic, xAI, OpenRouter, HuggingFace), and TTS services (ElevenLabs, Cartesia, Deepgram, Pocket, AWS Polly)
  • Turn Detection: Implement conversation turn management with built-in providers (Deepgram, Vogent, Smart Turn) or use realtime models with native turn detection
  • Voice Configuration: Customize voice characteristics, response latency, and audio quality across all supported providers

Video Agent Development

  • Realtime Video Streaming: Stream video directly to models with native vision support (OpenAI Realtime, Gemini Live) at configurable FPS rates
  • Vision Language Models (VLMs): Use VLMs for video understanding and analysis with automatic frame buffering (NVIDIA Cosmos, HuggingFace, OpenRouter, Moondream)
  • Video Processing: Implement computer vision tasks with YOLO pose detection, object detection, segmentation, and custom processors
  • Frame Analysis: Process video frames at independent FPS rates with custom ML models and publish transformed video back to calls

Function Calling & Tool Integration

  • Function Registration: Register Python functions with @llm.register_function() decorator for automatic LLM invocation with type-safe parameters
  • Model Context Protocol (MCP): Connect to local or remote MCP servers for external tool access (GitHub, weather, databases, etc.)
  • Multi-Round Tool Calling: Support multiple tool-calling rounds where models can call additional tools based on previous results
  • Tool Events: Monitor tool execution with ToolStartEvent and ToolEndEvent for debugging and metrics

Knowledge & RAG

  • Gemini File Search: Automatic document chunking, embedding, and retrieval with content deduplication via SHA-256 hashing
  • TurboPuffer Integration: Hybrid search combining vector (semantic) and BM25 (keyword) search with Reciprocal Rank Fusion
  • Custom RAG Pipelines: Build document gathering, parsing, chunking, embedding, and reranking workflows
  • Knowledge Base Registration: Register search functions with LLMs for knowledge-powered conversations

Phone Integration

  • Twilio Integration: Connect agents to phone calls via Twilio with inbound and outbound calling support
  • WebSocket Audio Streaming: Bidirectional audio streaming between Twilio and agents via WebSocket
  • Call Management: Create call registries, manage tokens, and handle webhook validation for incoming calls
  • Phone + RAG: Combine phone calling with knowledge bases for customer support and information retrieval

Event System & Monitoring

  • Event Subscription: Subscribe to events using @agent.events.subscribe decorator with type hints for specific event types
  • Core Events: Monitor audio reception, transcription (STT), LLM responses, turn detection, participant joins/leaves, and call lifecycle
  • Component Events: Track events from STT, TTS, LLM, turn detection, and custom processors
  • Error Events: Handle recoverable and non-recoverable errors from each component with retry information
  • Custom Events: Emit custom events from processors and subscribe to them across the application

Agent Orchestration

  • Agent Class: Central orchestrator managing conversation flow, audio/video processing, response coordination, and external tool integration
  • Lifecycle Management: Control agent joining calls, finishing gracefully, and resource cleanup
  • Simple Response: Send text prompts to LLMs for processing and response generation
  • Participant Management: Track participant joins/leaves and customize agent behavior based on call composition

Server & Deployment

  • HTTP Server: Run agents as a production HTTP server with session management and concurrent agent support
  • Session Management: Create, monitor, and close agent sessions via REST API with metrics endpoints
  • Health Checks: Liveness (/health) and readiness (/ready) probes for Kubernetes integration
  • CORS Configuration: Customize allowed origins, methods, headers, and credentials for cross-origin requests
  • Authentication & Permissions: Implement custom user identification and permission callbacks for session access control
  • Session Limits: Control maximum concurrent sessions, sessions per call, session duration, and idle timeouts

Deployment & Scaling

  • Docker Support: CPU and GPU Dockerfiles with Python 3.13 slim base and PyTorch CUDA variants
  • Kubernetes Ready: Health probes, resource limits, rolling updates, and horizontal scaling support
  • Environment Configuration: Load API keys from .env files with automatic provider registration
  • Metrics Export: Prometheus metrics integration for monitoring latency, token usage, and error rates
  • Session Metrics: Per-session performance metrics including LLM latency, STT/TTS latency, and token counts

Video Debugging

  • Local Video Override: Test video processing without live cameras using local video files
  • CLI Override: Pass --video-track-override=/path/to/video.mp4 when running agents
  • Programmatic Override: Set video override path via agent.set_video_track_override_path()
  • Looping Playback: Local video files play in a loop at 30 FPS for reproducible testing

Workflows

Building a Voice Agent

  1. Create an Agent instance with edge=getstream.Edge() for transport
  2. Configure LLM: Choose realtime model (e.g., gemini.Realtime()) or custom pipeline with llm=gemini.LLM(), stt=deepgram.STT(), tts=elevenlabs.TTS()
  3. Set system instructions with instructions="Your agent behavior"
  4. Register functions with @llm.register_function() for tool calling
  5. Create a call with call = await agent.create_call(call_type, call_id)
  6. Join the call with async with agent.join(call):
  7. Send initial response with await agent.simple_response("greeting")
  8. Wait for call to end with await agent.finish()

Building a Video Agent with Processing

  1. Create Agent with video-capable LLM (e.g., gemini.Realtime(fps=3))
  2. Add video processors: processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt")]
  3. Subscribe to detection events: @agent.events.subscribe async def on_detection(event: VideoProcessorDetectionEvent):
  4. Join call and process video frames automatically
  5. LLM receives annotated frames for analysis and response

Implementing RAG with Agents

  1. Create RAG store: store = gemini.GeminiFilesearchRAG(name="knowledge-base")
  2. Populate with documents: await store.add_directory("./knowledge")
  3. Register search function: @llm.register_function() async def search_knowledge(query: str):
  4. Use TurboPuffer for hybrid search: rag = turbopuffer.TurboPufferRAG(namespace="my-knowledge")
  5. Agent automatically calls search when relevant to conversation

Deploying Agent as HTTP Server

  1. Define create_agent() factory function returning configured Agent
  2. Define join_call() function specifying agent behavior when joining calls
  3. Create AgentLauncher(create_agent=create_agent, join_call=join_call)
  4. Wrap with Runner(launcher) for CLI support
  5. Run with uv run agent.py serve --host 0.0.0.0 --port 8000
  6. Create sessions via POST /sessions with call_type and call_id
  7. Monitor with GET /sessions/{session_id}/metrics

Integrating Phone Calling

  1. Set up Twilio account with phone number and ngrok for local development
  2. Create TwilioCallRegistry to manage pending calls
  3. Implement webhook handler at /twilio/voice to validate Twilio signature
  4. Return TwiML with media stream URL for bidirectional audio
  5. Use attach_phone_to_call() to connect phone audio to Stream call
  6. For outbound calls, use Twilio REST API with media stream TwiML

Monitoring with Events

  1. Subscribe to participant events: @agent.events.subscribe async def on_join(event: CallSessionParticipantJoinedEvent):
  2. Track transcription: @agent.events.subscribe async def on_transcript(event: STTTranscriptEvent):
  3. Monitor LLM responses: @agent.events.subscribe async def on_response(event: LLMResponseCompletedEvent):
  4. Handle tool execution: @agent.events.subscribe async def on_tool_end(event: ToolEndEvent):
  5. Catch errors: @agent.events.subscribe async def on_error(event: STTErrorEvent):

Integration

Vision Agents integrates with 25+ AI providers through a plugin architecture:

LLM Providers: OpenAI (GPT-4, GPT-Realtime), Google Gemini, Anthropic Claude, xAI Grok, OpenRouter, HuggingFace, AWS Bedrock

Realtime Models: OpenAI Realtime (WebRTC), Gemini Live (WebSocket), AWS Nova, Qwen Omni

Speech Services: Deepgram (STT/TTS), ElevenLabs (TTS), Cartesia (TTS), AWS Polly (TTS), Fast-Whisper (STT), Fish (STT), Wizper (STT), Pocket (TTS)

Vision Models: NVIDIA Cosmos, HuggingFace VLMs, Moondream, Roboflow, Ultralytics YOLO

External Tools: Model Context Protocol (MCP) servers for GitHub, weather, databases, and custom services

Transport: Stream’s global edge network (default) or custom transport layers

Deployment: Docker, Kubernetes, FastAPI, Prometheus metrics

Phone: Twilio for inbound/outbound calling

Context

Vision Agents follows a “bring your own keys” (BOYK) model where developers provide their own API credentials for each service. The framework is transport-agnostic, defaulting to Stream’s low-latency edge network but supporting custom transports.

The architecture is modular and event-driven, allowing independent configuration of each component (LLM, STT, TTS, VAD, processors). Realtime models offer the lowest latency for voice interactions by handling speech-to-speech natively, while custom pipelines provide flexibility to mix providers and customize behavior.

Video processing uses a shared VideoForwarder that distributes frames to multiple processors at independent FPS rates, enabling efficient multi-model inference. Processors can be chained for sequential processing (detection → analysis → response).

The framework supports both development (console mode with local video files) and production (HTTP server with session management, metrics, and scaling). Session limits and resource management prevent runaway costs and server exhaustion. All components emit events for monitoring, debugging, and building reactive workflows.

Turn detection can be built-in (realtime models) or external (Deepgram, Vogent, Smart Turn), with configurable sensitivity and latency. Function calling is automatic—LLMs invoke registered functions when relevant, with support for multi-round tool calling and custom execution tracking.


For additional documentation and navigation, see: https://visionagents.ai/llms.txt