venice-ai

📁 majiayu000/claude-skill-registry 📅 9 days ago
1
总安装量
1
周安装量
#53908
全站排名
安装命令
npx skills add https://github.com/majiayu000/claude-skill-registry --skill venice-ai

Agent 安装分布

kilo 1
amp 1
opencode 1
codex 1
github-copilot 1
claude-code 1

Skill 文档

Venice AI Skill

Expert assistance with Venice AI platform development, featuring advanced prompt caching optimization, API integration patterns, and cost-effective AI model access across multiple providers (Anthropic Claude, OpenAI GPT, Google Gemini, xAI Grok, DeepSeek, and more).

When to Use This Skill

Core Use Cases

  • Prompt Caching Optimization: Reduce API costs by 36-90% and latency by up to 80%
  • Multi-Provider AI Integration: Access Claude, GPT, Gemini, Grok, DeepSeek via unified API
  • Cost Optimization: Implement caching strategies for high-volume AI applications
  • Performance Tuning: Optimize prompt structure for maximum cache hit rates
  • Conversation Management: Build efficient multi-turn chatbots with cached context

Trigger Phrases

  • “Optimize Venice AI costs with prompt caching”
  • “Implement prompt caching for [Claude/GPT/Gemini]”
  • “Reduce AI API latency using Venice”
  • “Build cost-efficient chatbot with Venice AI”
  • “Integrate multiple AI providers through Venice”
  • “Debug Venice AI cache hit rates”
  • “Structure prompts for maximum caching efficiency”

Quick Reference

Prompt Caching Fundamentals

How It Works: Caching uses prefix matching—identical prompt beginnings across requests retrieve pre-processed tokens instead of recomputing them.

Key Benefit: Reduce costs by 36-90% and latency by up to 80% for longer prompts.

Minimum Requirements:

  • Claude Opus 4.5: ~4,000 tokens
  • GPT, Gemini, Grok, DeepSeek: ~1,024 tokens

Common Patterns

1. Basic Chatbot with Cached System Prompt

import requests

# 2,000-token system prompt (cached after first request)
system_prompt = """You are a helpful AI assistant specialized in..."""

# First request - writes to cache
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "What's the weather?"}
        ],
        "prompt_cache_key": "user-session-123"  # Route to same server
    }
)

# Result: 2,050 tokens processed, 0 cached, full cost

# Subsequent requests - reads from cache
response2 = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {"role": "system", "content": system_prompt},  # MUST be byte-identical
            {"role": "user", "content": "Tell me a joke"}
        ],
        "prompt_cache_key": "user-session-123"  # Same key = better hit rate
    }
)

# Result: 2,000 cached (90% discount), 80 new tokens
# Cost savings: ~64% compared to no caching

2. Claude-Specific Cache Control (Advanced)

# For Claude: Explicitly mark cache breakpoints
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "model": "claude-opus-4-5",
        "messages": [
            {
                "role": "system",
                "content": [
                    {
                        "type": "text",
                        "text": "You are a helpful AI assistant...",
                        "cache_control": {"type": "ephemeral"}  # Cache this block
                    }
                ]
            },
            {
                "role": "user",
                "content": "Summarize this document:\n\n{long_document}",
                "cache_control": {"type": "ephemeral"}  # Cache document too
            }
        ]
    }
)

# Venice auto-applies cache_control to system prompts for Claude
# Manual markers needed for caching beyond system messages

3. Monitoring Cache Performance

response = requests.post(...)
usage = response.json()["usage"]

cached_tokens = usage["prompt_tokens_details"]["cached_tokens"]
total_prompt_tokens = usage["prompt_tokens"]
cache_hit_rate = (cached_tokens / total_prompt_tokens) * 100

print(f"Cache Hit Rate: {cache_hit_rate:.1f}%")
print(f"Cached Tokens: {cached_tokens} (90% discount)")
print(f"New Tokens: {total_prompt_tokens - cached_tokens} (full price)")

# Troubleshooting
if cache_hit_rate == 0:
    print("⚠️ Zero cache hits - check:")
    print("  • Prompt below minimum threshold (~1,024 or ~4,000 for Claude)?")
    print("  • Prompt prefix changed (whitespace, timestamps)?")
    print("  • First request (writes cache, doesn't read)?")
    print("  • Cache expired (>5-10 min since last request)?")

4. Cost-Optimized Prompt Structure

# ❌ BAD: Dynamic content first (breaks cache)
prompt = f"""
Current time: {datetime.now()}
User ID: {user_id}
System: You are a helpful assistant...
"""

# ✅ GOOD: Static content first (maximizes cache)
prompt = """
System: You are a helpful assistant with these capabilities:
- Answer questions
- Provide code examples
- Debug issues

Reference documentation:
{static_docs}  # 5,000 tokens - cached

Examples:
{static_examples}  # 2,000 tokens - cached
"""

# Add dynamic content AFTER static prefix
messages = [
    {"role": "system", "content": prompt},  # 7,000 tokens cached
    {"role": "user", "content": f"User {user_id} asks: {question}"}  # New tokens only
]

5. Multi-Turn Conversation with Caching

// TypeScript example with conversation history
const conversationHistory: Message[] = [];
const SYSTEM_PROMPT = "...";  // 2,000 tokens

async function chat(userMessage: string, sessionId: string) {
  conversationHistory.push({
    role: "user",
    content: userMessage
  });

  const response = await fetch("https://api.venice.ai/api/v1/chat/completions", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.VENICE_API_KEY}`,
      "Content-Type": "application/json"
    },
    body: JSON.stringify({
      model: "gpt-5.2",  // Auto-caching, no cache_control needed
      messages: [
        { role: "system", content: SYSTEM_PROMPT },  // Cached
        ...conversationHistory  // Growing conversation - partially cached
      ],
      prompt_cache_key: sessionId  // Route to same server for warm cache
    })
  });

  const data = await response.json();
  conversationHistory.push({
    role: "assistant",
    content: data.choices[0].message.content
  });

  return {
    message: data.choices[0].message.content,
    cached_tokens: data.usage.prompt_tokens_details.cached_tokens,
    cache_savings_pct: (data.usage.prompt_tokens_details.cached_tokens / data.usage.prompt_tokens) * 100
  };
}

Prompt Caching Best Practices

1. Optimal Prompt Structure

Place static content before dynamic content to maximize cached prefix:

✅ GOOD ORDER:
- System prompt (always static)
- Reference documents (static)
- Examples (static)
- Tools/function definitions (static)
- Conversation history (grows but prefix is stable)
- Current user message (dynamic)

❌ BAD ORDER:
- Timestamp or request ID
- User message
- System prompt

2. Maintain Byte-Identical Prefixes

Cache keys derive from exact byte sequences. Avoid:

  • Different whitespace or line breaks
  • Timestamps within prompts
  • Random UUIDs before static content
  • Inconsistent formatting

3. Meet Minimum Token Thresholds

  • Claude Opus 4.5: ~4,000 tokens minimum
  • GPT, Gemini, Grok, DeepSeek: ~1,024 tokens minimum
  • Bundle context if below threshold

4. Use prompt_cache_key for Multi-Turn

Consistent routing hint across conversation improves hit rates:

prompt_cache_key = f"session-{user_id}-{conversation_id}"

5. Monitor Cache Economics

Cost Comparison:

Without Caching:
  Request 1: 2,050 tokens × $6.00/1M = $0.0123
  Request 2: 2,080 tokens × $6.00/1M = $0.0125
  Request 3: 2,120 tokens × $6.00/1M = $0.0127
  Total: $0.0375

With Caching (Claude):
  Request 1: 2,050 × $7.50/1M = $0.0154 (write premium)
  Request 2: 2,000 cached × $0.60/1M + 80 new × $6.00/1M = $0.0017
  Request 3: 2,000 cached × $0.60/1M + 120 new × $6.00/1M = $0.0019
  Total: $0.0190

Savings: 49% (break-even after 2nd request)

Provider-Specific Details

Claude (Anthropic)

  • Write Premium: +25% ($7.50/1M vs $6.00/1M)
  • Read Discount: 90% ($0.60/1M)
  • Requires: Explicit cache_control: {"type": "ephemeral"} markers
  • Minimum Tokens: ~4,000
  • Cache Lifetime: 5 minutes
  • Breakpoints: Up to 4 per request
  • Break-Even: 2nd request using same prefix

GPT (OpenAI)

  • Write Premium: None
  • Read Discount: 90%
  • Auto-Caching: No configuration needed
  • Minimum Tokens: ~1,024
  • Cache Lifetime: 5-10 minutes

Gemini (Google)

  • Write Premium: None
  • Read Discount: 75-90%
  • Auto-Caching: Yes
  • Minimum Tokens: ~1,024
  • Cache Lifetime: 1 hour (longest)

Grok (xAI)

  • Write Premium: None
  • Read Discount: 75-88%
  • Auto-Caching: Yes
  • Minimum Tokens: ~1,024
  • Cache Lifetime: 5 minutes

DeepSeek

  • Write Premium: None
  • Read Discount: 50%
  • Auto-Caching: Yes
  • Minimum Tokens: ~1,024
  • Cache Lifetime: 5 minutes

Troubleshooting Guide

Zero Cached Tokens

Symptoms: cached_tokens: 0 in every response

Causes:

  1. Below Threshold: Prompt < 1,024 tokens (or < 4,000 for Claude)
    • Fix: Add more context, examples, or documentation
  2. Dynamic Prefix: Changing content at start of prompt
    • Fix: Move timestamps/IDs to end, maintain static prefix
  3. First Request: Initial request writes cache, doesn’t read
    • Fix: Normal behavior, subsequent requests will show caching
  4. Cache Expired: > 5-10 min since last request
    • Fix: Increase request frequency or accept cold starts
  5. Server Inconsistency: Requests hitting different servers
    • Fix: Use prompt_cache_key for routing affinity

Continuous Cache Writes

Symptoms: cache_creation_input_tokens > 0 on every request

Causes:

  1. Prompt Mutations: Small changes breaking byte identity
    • Fix: Remove timestamps, use consistent formatting
  2. Missing cache_control (Claude only)
    • Fix: Add explicit markers or rely on Venice auto-application
  3. Insufficient Tokens: Below minimum threshold
    • Fix: Enrich prompt with more context

Higher Than Expected Costs

Causes:

  1. Claude Write Premium: Not enough reuse to amortize +25% cost
    • Fix: Ensure at least 3-4 requests reuse same prefix
  2. Changing Prompts: Every request = new cache write
    • Fix: Stabilize prompt structure
  3. Poor Structure: Dynamic before static
    • Fix: Reorganize (see Best Practices #1)

Low Cache Hit Rate

Target: > 70% for chatbots, > 50% for variable workloads

Diagnostic:

hit_rate = (cached_tokens / prompt_tokens) * 100
if hit_rate < 50:
    print("Investigate: Prompt structure, token threshold, or expiration")

Reference Files

This skill includes comprehensive documentation in references/:

  • llms-txt.md – Complete Venice AI documentation
  • llms-full.md – Extended documentation with examples
  • llms.md – Quick reference summary

Use read to view specific reference files when detailed information is needed.

Advanced Use Cases

1. Caching with Tools and Functions

Function definitions integrate into cached prefix:

# Function definitions are part of static prefix
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                }
            }
        }
    }
]

# Tools definition cached along with system prompt
response = requests.post(
    "https://api.venice.ai/api/v1/chat/completions",
    json={
        "model": "gpt-5.2",
        "messages": [...],
        "tools": tools  # Included in cached prefix
    }
)

2. Caching with Images (Vision Models)

Cache images and accompanying text for repeated questions:

# First request - cache image + context
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "data:image/jpeg;base64,..."}
            },
            {
                "type": "text",
                "text": "This is a diagram of our system architecture."
            }
        ]
    },
    {
        "role": "user",
        "content": "What are the main components?"
    }
]

# Subsequent requests - image and context cached
messages.append({
    "role": "user",
    "content": "How does data flow between components?"
})
# Only new question processed, image + context from cache

3. Document Analysis with Caching

Load large documents once, ask multiple questions:

# 10,000-token document cached after first request
document = """
[Long legal contract, technical manual, or research paper]
"""

def analyze_document(question: str):
    return requests.post(
        "https://api.venice.ai/api/v1/chat/completions",
        json={
            "model": "claude-opus-4-5",
            "messages": [
                {
                    "role": "system",
                    "content": f"Analyze this document:\n\n{document}"
                },
                {"role": "user", "content": question}
            ],
            "prompt_cache_key": "doc-analysis-session-xyz"
        }
    )

# Each question only processes new query, document stays cached
analyze_document("What are the key terms?")
analyze_document("Summarize section 3")
analyze_document("What are the risks?")
# 10,000 tokens cached at 90% discount × 3 requests = massive savings

4. RAG (Retrieval Augmented Generation) with Caching

Cache retrieved context for follow-up questions:

# Retrieve relevant chunks from vector DB
relevant_chunks = vector_db.search(user_query)
context = "\n\n".join([chunk.text for chunk in relevant_chunks])

# Cache the retrieved context (5,000 tokens)
messages = [
    {
        "role": "system",
        "content": f"""Use this context to answer questions:

{context}

Context above is authoritative. If answer not in context, say so."""
    },
    {"role": "user", "content": user_query}
]

# Follow-up questions reuse cached context
# Only new query processed, context cached

5. Batch Processing with Shared Context

Process multiple items with same instructions:

# Shared instructions cached
instructions = """
You are a code reviewer. For each code sample, check:
- Style compliance
- Security vulnerabilities
- Performance issues
- Best practices
"""

def review_code(code_sample: str, batch_id: str):
    return requests.post(
        "https://api.venice.ai/api/v1/chat/completions",
        json={
            "model": "gpt-5.2",
            "messages": [
                {"role": "system", "content": instructions},
                {"role": "user", "content": f"Review:\n\n```\n{code_sample}\n```"}
            ],
            "prompt_cache_key": f"batch-{batch_id}"
        }
    )

# Review 100 files - instructions cached for all
for file in code_files:
    review_code(file.content, "batch-001")
# 100 requests, instructions cached 99 times

Working with This Skill

For Prompt Caching Optimization

  1. Assess Current Usage: Calculate cache hit rate and cost per 1K tokens
  2. Restructure Prompts: Move static content before dynamic
  3. Implement Monitoring: Track cached_tokens in responses
  4. Tune for Economics: Balance write premium (Claude) vs read savings

For API Integration

  1. Choose Provider: Claude (quality), GPT (balanced), Gemini (cost), Grok (speed)
  2. Set Up Authentication: Get API key from Venice.ai dashboard
  3. Implement Base Client: Use examples from Quick Reference
  4. Add Error Handling: Handle rate limits, timeouts, cache misses

For Cost Reduction

  1. Identify High-Volume Endpoints: Where caching has biggest impact
  2. Measure Baseline: Cost and latency without caching
  3. Apply Caching Patterns: Use examples from this skill
  4. Validate Savings: Compare cached vs uncached costs

For Code Examples

The Quick Reference section contains 5 production-ready patterns. The Advanced Use Cases section covers 5 specialized scenarios.

Performance Benchmarks

Latency Improvements

  • Short prompts (< 1,000 tokens): Minimal impact (~5-10% faster)
  • Medium prompts (1,000-5,000 tokens): 30-50% faster
  • Long prompts (> 5,000 tokens): Up to 80% faster

Cost Reductions

  • Single request: No savings (cache write cost)
  • 2-5 requests: 30-50% savings
  • 10+ requests: 60-80% savings
  • 100+ requests: Up to 90% savings

Cache Hit Rates

  • Chatbots (stable system prompt): 70-90%
  • Document analysis: 80-95%
  • RAG systems: 60-80%
  • Batch processing: 85-95%

API Reference Summary

Base URL

https://api.venice.ai/api/v1

Authentication

Authorization: Bearer YOUR_API_KEY

Key Parameters

Parameter Type Required Description
model string Yes Model ID (e.g., claude-opus-4-5, gpt-5.2)
messages array Yes Conversation messages
prompt_cache_key string No Routing hint for cache affinity
cache_control object No Cache markers (Claude only)
temperature float No Sampling temperature (0-2)
max_tokens int No Maximum completion length

Response Fields

{
  "usage": {
    "prompt_tokens": 5500,
    "completion_tokens": 200,
    "prompt_tokens_details": {
      "cached_tokens": 5000,          // Tokens from cache (90% discount)
      "cache_creation_input_tokens": 0 // Tokens written to cache (may have premium)
    }
  }
}

Resources

references/

Organized documentation extracted from official Venice AI sources:

  • llms-txt.md: Complete API documentation and guides
  • llms-full.md: Extended documentation with additional examples
  • llms.md: Quick reference for common operations

Code Examples

This skill includes 10+ production-ready code examples:

  • Python, TypeScript, JavaScript
  • Error handling and monitoring
  • Cost optimization patterns
  • Multi-provider integration

Notes

  • Version 2.0: Enhanced with comprehensive prompt caching optimization guide
  • Focus: Cost reduction (36-90%), latency improvement (up to 80%)
  • Coverage: All major providers (Claude, GPT, Gemini, Grok, DeepSeek, MiniMax, Kimi)
  • Production-Ready: Battle-tested patterns from real-world deployments
  • Updated: January 2026 with latest Venice AI capabilities

Success Metrics

Track these KPIs to measure caching effectiveness:

Cost Metrics

cost_per_1k_tokens = total_cost / (total_tokens / 1000)
savings_percentage = (uncached_cost - cached_cost) / uncached_cost * 100
roi_requests = write_premium_cost / read_savings_per_request  # Break-even point

Performance Metrics

cache_hit_rate = cached_tokens / total_prompt_tokens * 100
avg_latency_improvement = (uncached_latency - cached_latency) / uncached_latency * 100
tokens_saved = cached_tokens * (1 - discount_rate)

Quality Metrics

consistency_score = requests_with_cache_hits / total_requests * 100
cache_uptime = requests_within_ttl / total_requests * 100

When NOT to Use Caching

Caching may not be cost-effective when:

  • Single-use prompts: No repeat requests to amortize write premium
  • Highly dynamic: Every request has unique prefix
  • Below threshold: Prompts consistently < 1,024 tokens (or < 4,000 for Claude)
  • Infrequent requests: > 10 min between requests (cache expires)
  • Rapid iteration: Constantly changing prompt structure during development

Updating

To refresh this skill with updated Venice AI documentation:

  1. Check for new features at https://docs.venice.ai
  2. Re-run skill-seekers scrape with updated config
  3. Merge new examples into Quick Reference section
  4. Update provider-specific details if pricing changes
  5. Increment version number and document changes