venice-ai
npx skills add https://github.com/majiayu000/claude-skill-registry --skill venice-ai
Agent 安装分布
Skill 文档
Venice AI Skill
Expert assistance with Venice AI platform development, featuring advanced prompt caching optimization, API integration patterns, and cost-effective AI model access across multiple providers (Anthropic Claude, OpenAI GPT, Google Gemini, xAI Grok, DeepSeek, and more).
When to Use This Skill
Core Use Cases
- Prompt Caching Optimization: Reduce API costs by 36-90% and latency by up to 80%
- Multi-Provider AI Integration: Access Claude, GPT, Gemini, Grok, DeepSeek via unified API
- Cost Optimization: Implement caching strategies for high-volume AI applications
- Performance Tuning: Optimize prompt structure for maximum cache hit rates
- Conversation Management: Build efficient multi-turn chatbots with cached context
Trigger Phrases
- “Optimize Venice AI costs with prompt caching”
- “Implement prompt caching for [Claude/GPT/Gemini]”
- “Reduce AI API latency using Venice”
- “Build cost-efficient chatbot with Venice AI”
- “Integrate multiple AI providers through Venice”
- “Debug Venice AI cache hit rates”
- “Structure prompts for maximum caching efficiency”
Quick Reference
Prompt Caching Fundamentals
How It Works: Caching uses prefix matchingâidentical prompt beginnings across requests retrieve pre-processed tokens instead of recomputing them.
Key Benefit: Reduce costs by 36-90% and latency by up to 80% for longer prompts.
Minimum Requirements:
- Claude Opus 4.5: ~4,000 tokens
- GPT, Gemini, Grok, DeepSeek: ~1,024 tokens
Common Patterns
1. Basic Chatbot with Cached System Prompt
import requests
# 2,000-token system prompt (cached after first request)
system_prompt = """You are a helpful AI assistant specialized in..."""
# First request - writes to cache
response = requests.post(
"https://api.venice.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "claude-opus-4-5",
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "What's the weather?"}
],
"prompt_cache_key": "user-session-123" # Route to same server
}
)
# Result: 2,050 tokens processed, 0 cached, full cost
# Subsequent requests - reads from cache
response2 = requests.post(
"https://api.venice.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "claude-opus-4-5",
"messages": [
{"role": "system", "content": system_prompt}, # MUST be byte-identical
{"role": "user", "content": "Tell me a joke"}
],
"prompt_cache_key": "user-session-123" # Same key = better hit rate
}
)
# Result: 2,000 cached (90% discount), 80 new tokens
# Cost savings: ~64% compared to no caching
2. Claude-Specific Cache Control (Advanced)
# For Claude: Explicitly mark cache breakpoints
response = requests.post(
"https://api.venice.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "claude-opus-4-5",
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a helpful AI assistant...",
"cache_control": {"type": "ephemeral"} # Cache this block
}
]
},
{
"role": "user",
"content": "Summarize this document:\n\n{long_document}",
"cache_control": {"type": "ephemeral"} # Cache document too
}
]
}
)
# Venice auto-applies cache_control to system prompts for Claude
# Manual markers needed for caching beyond system messages
3. Monitoring Cache Performance
response = requests.post(...)
usage = response.json()["usage"]
cached_tokens = usage["prompt_tokens_details"]["cached_tokens"]
total_prompt_tokens = usage["prompt_tokens"]
cache_hit_rate = (cached_tokens / total_prompt_tokens) * 100
print(f"Cache Hit Rate: {cache_hit_rate:.1f}%")
print(f"Cached Tokens: {cached_tokens} (90% discount)")
print(f"New Tokens: {total_prompt_tokens - cached_tokens} (full price)")
# Troubleshooting
if cache_hit_rate == 0:
print("â ï¸ Zero cache hits - check:")
print(" ⢠Prompt below minimum threshold (~1,024 or ~4,000 for Claude)?")
print(" ⢠Prompt prefix changed (whitespace, timestamps)?")
print(" ⢠First request (writes cache, doesn't read)?")
print(" ⢠Cache expired (>5-10 min since last request)?")
4. Cost-Optimized Prompt Structure
# â BAD: Dynamic content first (breaks cache)
prompt = f"""
Current time: {datetime.now()}
User ID: {user_id}
System: You are a helpful assistant...
"""
# â
GOOD: Static content first (maximizes cache)
prompt = """
System: You are a helpful assistant with these capabilities:
- Answer questions
- Provide code examples
- Debug issues
Reference documentation:
{static_docs} # 5,000 tokens - cached
Examples:
{static_examples} # 2,000 tokens - cached
"""
# Add dynamic content AFTER static prefix
messages = [
{"role": "system", "content": prompt}, # 7,000 tokens cached
{"role": "user", "content": f"User {user_id} asks: {question}"} # New tokens only
]
5. Multi-Turn Conversation with Caching
// TypeScript example with conversation history
const conversationHistory: Message[] = [];
const SYSTEM_PROMPT = "..."; // 2,000 tokens
async function chat(userMessage: string, sessionId: string) {
conversationHistory.push({
role: "user",
content: userMessage
});
const response = await fetch("https://api.venice.ai/api/v1/chat/completions", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.VENICE_API_KEY}`,
"Content-Type": "application/json"
},
body: JSON.stringify({
model: "gpt-5.2", // Auto-caching, no cache_control needed
messages: [
{ role: "system", content: SYSTEM_PROMPT }, // Cached
...conversationHistory // Growing conversation - partially cached
],
prompt_cache_key: sessionId // Route to same server for warm cache
})
});
const data = await response.json();
conversationHistory.push({
role: "assistant",
content: data.choices[0].message.content
});
return {
message: data.choices[0].message.content,
cached_tokens: data.usage.prompt_tokens_details.cached_tokens,
cache_savings_pct: (data.usage.prompt_tokens_details.cached_tokens / data.usage.prompt_tokens) * 100
};
}
Prompt Caching Best Practices
1. Optimal Prompt Structure
Place static content before dynamic content to maximize cached prefix:
â
GOOD ORDER:
- System prompt (always static)
- Reference documents (static)
- Examples (static)
- Tools/function definitions (static)
- Conversation history (grows but prefix is stable)
- Current user message (dynamic)
â BAD ORDER:
- Timestamp or request ID
- User message
- System prompt
2. Maintain Byte-Identical Prefixes
Cache keys derive from exact byte sequences. Avoid:
- Different whitespace or line breaks
- Timestamps within prompts
- Random UUIDs before static content
- Inconsistent formatting
3. Meet Minimum Token Thresholds
- Claude Opus 4.5: ~4,000 tokens minimum
- GPT, Gemini, Grok, DeepSeek: ~1,024 tokens minimum
- Bundle context if below threshold
4. Use prompt_cache_key for Multi-Turn
Consistent routing hint across conversation improves hit rates:
prompt_cache_key = f"session-{user_id}-{conversation_id}"
5. Monitor Cache Economics
Cost Comparison:
Without Caching:
Request 1: 2,050 tokens à $6.00/1M = $0.0123
Request 2: 2,080 tokens à $6.00/1M = $0.0125
Request 3: 2,120 tokens à $6.00/1M = $0.0127
Total: $0.0375
With Caching (Claude):
Request 1: 2,050 Ã $7.50/1M = $0.0154 (write premium)
Request 2: 2,000 cached à $0.60/1M + 80 new à $6.00/1M = $0.0017
Request 3: 2,000 cached à $0.60/1M + 120 new à $6.00/1M = $0.0019
Total: $0.0190
Savings: 49% (break-even after 2nd request)
Provider-Specific Details
Claude (Anthropic)
- Write Premium: +25% ($7.50/1M vs $6.00/1M)
- Read Discount: 90% ($0.60/1M)
- Requires: Explicit
cache_control: {"type": "ephemeral"}markers - Minimum Tokens: ~4,000
- Cache Lifetime: 5 minutes
- Breakpoints: Up to 4 per request
- Break-Even: 2nd request using same prefix
GPT (OpenAI)
- Write Premium: None
- Read Discount: 90%
- Auto-Caching: No configuration needed
- Minimum Tokens: ~1,024
- Cache Lifetime: 5-10 minutes
Gemini (Google)
- Write Premium: None
- Read Discount: 75-90%
- Auto-Caching: Yes
- Minimum Tokens: ~1,024
- Cache Lifetime: 1 hour (longest)
Grok (xAI)
- Write Premium: None
- Read Discount: 75-88%
- Auto-Caching: Yes
- Minimum Tokens: ~1,024
- Cache Lifetime: 5 minutes
DeepSeek
- Write Premium: None
- Read Discount: 50%
- Auto-Caching: Yes
- Minimum Tokens: ~1,024
- Cache Lifetime: 5 minutes
Troubleshooting Guide
Zero Cached Tokens
Symptoms: cached_tokens: 0 in every response
Causes:
- Below Threshold: Prompt < 1,024 tokens (or < 4,000 for Claude)
- Fix: Add more context, examples, or documentation
- Dynamic Prefix: Changing content at start of prompt
- Fix: Move timestamps/IDs to end, maintain static prefix
- First Request: Initial request writes cache, doesn’t read
- Fix: Normal behavior, subsequent requests will show caching
- Cache Expired: > 5-10 min since last request
- Fix: Increase request frequency or accept cold starts
- Server Inconsistency: Requests hitting different servers
- Fix: Use
prompt_cache_keyfor routing affinity
- Fix: Use
Continuous Cache Writes
Symptoms: cache_creation_input_tokens > 0 on every request
Causes:
- Prompt Mutations: Small changes breaking byte identity
- Fix: Remove timestamps, use consistent formatting
- Missing cache_control (Claude only)
- Fix: Add explicit markers or rely on Venice auto-application
- Insufficient Tokens: Below minimum threshold
- Fix: Enrich prompt with more context
Higher Than Expected Costs
Causes:
- Claude Write Premium: Not enough reuse to amortize +25% cost
- Fix: Ensure at least 3-4 requests reuse same prefix
- Changing Prompts: Every request = new cache write
- Fix: Stabilize prompt structure
- Poor Structure: Dynamic before static
- Fix: Reorganize (see Best Practices #1)
Low Cache Hit Rate
Target: > 70% for chatbots, > 50% for variable workloads
Diagnostic:
hit_rate = (cached_tokens / prompt_tokens) * 100
if hit_rate < 50:
print("Investigate: Prompt structure, token threshold, or expiration")
Reference Files
This skill includes comprehensive documentation in references/:
- llms-txt.md – Complete Venice AI documentation
- llms-full.md – Extended documentation with examples
- llms.md – Quick reference summary
Use read to view specific reference files when detailed information is needed.
Advanced Use Cases
1. Caching with Tools and Functions
Function definitions integrate into cached prefix:
# Function definitions are part of static prefix
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}
]
# Tools definition cached along with system prompt
response = requests.post(
"https://api.venice.ai/api/v1/chat/completions",
json={
"model": "gpt-5.2",
"messages": [...],
"tools": tools # Included in cached prefix
}
)
2. Caching with Images (Vision Models)
Cache images and accompanying text for repeated questions:
# First request - cache image + context
messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,..."}
},
{
"type": "text",
"text": "This is a diagram of our system architecture."
}
]
},
{
"role": "user",
"content": "What are the main components?"
}
]
# Subsequent requests - image and context cached
messages.append({
"role": "user",
"content": "How does data flow between components?"
})
# Only new question processed, image + context from cache
3. Document Analysis with Caching
Load large documents once, ask multiple questions:
# 10,000-token document cached after first request
document = """
[Long legal contract, technical manual, or research paper]
"""
def analyze_document(question: str):
return requests.post(
"https://api.venice.ai/api/v1/chat/completions",
json={
"model": "claude-opus-4-5",
"messages": [
{
"role": "system",
"content": f"Analyze this document:\n\n{document}"
},
{"role": "user", "content": question}
],
"prompt_cache_key": "doc-analysis-session-xyz"
}
)
# Each question only processes new query, document stays cached
analyze_document("What are the key terms?")
analyze_document("Summarize section 3")
analyze_document("What are the risks?")
# 10,000 tokens cached at 90% discount à 3 requests = massive savings
4. RAG (Retrieval Augmented Generation) with Caching
Cache retrieved context for follow-up questions:
# Retrieve relevant chunks from vector DB
relevant_chunks = vector_db.search(user_query)
context = "\n\n".join([chunk.text for chunk in relevant_chunks])
# Cache the retrieved context (5,000 tokens)
messages = [
{
"role": "system",
"content": f"""Use this context to answer questions:
{context}
Context above is authoritative. If answer not in context, say so."""
},
{"role": "user", "content": user_query}
]
# Follow-up questions reuse cached context
# Only new query processed, context cached
5. Batch Processing with Shared Context
Process multiple items with same instructions:
# Shared instructions cached
instructions = """
You are a code reviewer. For each code sample, check:
- Style compliance
- Security vulnerabilities
- Performance issues
- Best practices
"""
def review_code(code_sample: str, batch_id: str):
return requests.post(
"https://api.venice.ai/api/v1/chat/completions",
json={
"model": "gpt-5.2",
"messages": [
{"role": "system", "content": instructions},
{"role": "user", "content": f"Review:\n\n```\n{code_sample}\n```"}
],
"prompt_cache_key": f"batch-{batch_id}"
}
)
# Review 100 files - instructions cached for all
for file in code_files:
review_code(file.content, "batch-001")
# 100 requests, instructions cached 99 times
Working with This Skill
For Prompt Caching Optimization
- Assess Current Usage: Calculate cache hit rate and cost per 1K tokens
- Restructure Prompts: Move static content before dynamic
- Implement Monitoring: Track
cached_tokensin responses - Tune for Economics: Balance write premium (Claude) vs read savings
For API Integration
- Choose Provider: Claude (quality), GPT (balanced), Gemini (cost), Grok (speed)
- Set Up Authentication: Get API key from Venice.ai dashboard
- Implement Base Client: Use examples from Quick Reference
- Add Error Handling: Handle rate limits, timeouts, cache misses
For Cost Reduction
- Identify High-Volume Endpoints: Where caching has biggest impact
- Measure Baseline: Cost and latency without caching
- Apply Caching Patterns: Use examples from this skill
- Validate Savings: Compare cached vs uncached costs
For Code Examples
The Quick Reference section contains 5 production-ready patterns. The Advanced Use Cases section covers 5 specialized scenarios.
Performance Benchmarks
Latency Improvements
- Short prompts (< 1,000 tokens): Minimal impact (~5-10% faster)
- Medium prompts (1,000-5,000 tokens): 30-50% faster
- Long prompts (> 5,000 tokens): Up to 80% faster
Cost Reductions
- Single request: No savings (cache write cost)
- 2-5 requests: 30-50% savings
- 10+ requests: 60-80% savings
- 100+ requests: Up to 90% savings
Cache Hit Rates
- Chatbots (stable system prompt): 70-90%
- Document analysis: 80-95%
- RAG systems: 60-80%
- Batch processing: 85-95%
API Reference Summary
Base URL
https://api.venice.ai/api/v1
Authentication
Authorization: Bearer YOUR_API_KEY
Key Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | Model ID (e.g., claude-opus-4-5, gpt-5.2) |
messages |
array | Yes | Conversation messages |
prompt_cache_key |
string | No | Routing hint for cache affinity |
cache_control |
object | No | Cache markers (Claude only) |
temperature |
float | No | Sampling temperature (0-2) |
max_tokens |
int | No | Maximum completion length |
Response Fields
{
"usage": {
"prompt_tokens": 5500,
"completion_tokens": 200,
"prompt_tokens_details": {
"cached_tokens": 5000, // Tokens from cache (90% discount)
"cache_creation_input_tokens": 0 // Tokens written to cache (may have premium)
}
}
}
Resources
references/
Organized documentation extracted from official Venice AI sources:
- llms-txt.md: Complete API documentation and guides
- llms-full.md: Extended documentation with additional examples
- llms.md: Quick reference for common operations
Code Examples
This skill includes 10+ production-ready code examples:
- Python, TypeScript, JavaScript
- Error handling and monitoring
- Cost optimization patterns
- Multi-provider integration
Notes
- Version 2.0: Enhanced with comprehensive prompt caching optimization guide
- Focus: Cost reduction (36-90%), latency improvement (up to 80%)
- Coverage: All major providers (Claude, GPT, Gemini, Grok, DeepSeek, MiniMax, Kimi)
- Production-Ready: Battle-tested patterns from real-world deployments
- Updated: January 2026 with latest Venice AI capabilities
Success Metrics
Track these KPIs to measure caching effectiveness:
Cost Metrics
cost_per_1k_tokens = total_cost / (total_tokens / 1000)
savings_percentage = (uncached_cost - cached_cost) / uncached_cost * 100
roi_requests = write_premium_cost / read_savings_per_request # Break-even point
Performance Metrics
cache_hit_rate = cached_tokens / total_prompt_tokens * 100
avg_latency_improvement = (uncached_latency - cached_latency) / uncached_latency * 100
tokens_saved = cached_tokens * (1 - discount_rate)
Quality Metrics
consistency_score = requests_with_cache_hits / total_requests * 100
cache_uptime = requests_within_ttl / total_requests * 100
When NOT to Use Caching
Caching may not be cost-effective when:
- Single-use prompts: No repeat requests to amortize write premium
- Highly dynamic: Every request has unique prefix
- Below threshold: Prompts consistently < 1,024 tokens (or < 4,000 for Claude)
- Infrequent requests: > 10 min between requests (cache expires)
- Rapid iteration: Constantly changing prompt structure during development
Updating
To refresh this skill with updated Venice AI documentation:
- Check for new features at https://docs.venice.ai
- Re-run skill-seekers scrape with updated config
- Merge new examples into Quick Reference section
- Update provider-specific details if pricing changes
- Increment version number and document changes