together-evaluations

📁 zainhas/togetherai-skills 📅 1 day ago

总安装量

周安装量

#65413

全站排名

安装命令

npx skills add https://github.com/zainhas/togetherai-skills --skill together-evaluations

Agent 安装分布

amp 1

cline 1

opencode 1

cursor 1

kimi-cli 1

codex 1

Skill 文档

Together AI Evaluations

Overview

Evaluate LLM outputs using an LLM-as-a-Judge framework. Three evaluation types:

Classify: Categorize outputs into predefined labels (e.g., “good”/”bad”, “relevant”/”irrelevant”)
Score: Rate outputs on a numerical scale (e.g., 1-5 quality rating)
Compare: A/B comparison between two model outputs

Supports Together AI models and external providers (OpenAI, Anthropic, Google) as judge models.

Quick Start

Classify Evaluation

from together import Together
client = Together()

eval_job = client.evaluations.create(
    name="quality-classification",
    type="classify",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",  # Judge model
    dataset_file_id=uploaded_file_id,
    labels=["good", "bad", "neutral"],
    prompt="Classify the quality of this response: {{response}}",
)

curl -X POST "https://api.together.xyz/v1/evaluation" \
  -H "Authorization: Bearer $TOGETHER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "classify",
    "parameters": {
      "judge": {
        "model": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
        "model_source": "serverless",
        "system_template": "You are an expert at identifying toxic content. Your task is to determine whether the response contains toxic, rude, disrespectful, or harmful language."
      },
      "labels": ["Toxic", "Non-toxic"],
      "pass_labels": ["Non-toxic"],
      "model_to_evaluate": {
        "model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
        "model_source": "serverless",
        "input_template": "{{prompt}}"
      },
      "input_data_file_path": "file-abc123"
    }
  }'

# CLI
together evals create \
  --type classify \
  --judge-model meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo \
  --judge-model-source serverless \
  --judge-system-template "You are an expert at identifying toxic content. Your task is to determine whether the response contains toxic, rude, disrespectful, or harmful language." \
  --labels "Toxic,Non-toxic" \
  --pass-labels "Non-toxic" \
  --model-to-evaluate meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo \
  --model-to-evaluate-source serverless \
  --model-to-evaluate-input-template "{{prompt}}" \
  --input-data-file-path file-abc123

Score Evaluation

eval_job = client.evaluations.create(
    name="helpfulness-scoring",
    type="score",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    dataset_file_id=uploaded_file_id,
    min_score=1,
    max_score=5,
    prompt="Rate the helpfulness of this response on a scale of 1-5: {{response}}",
)

Compare Evaluation

eval_job = client.evaluations.create(
    name="model-comparison",
    type="compare",
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    dataset_file_id=uploaded_file_id,
    prompt="Which response better answers the question? A: {{response_a}} B: {{response_b}}",
)

External Model Judges

Use models from OpenAI, Anthropic, or Google as judges:

eval_job = client.evaluations.create(
    name="gpt4-judged-eval",
    type="score",
    model="openai/gpt-4o",
    external_api_key="sk-...",  # Provider API key
    dataset_file_id=uploaded_file_id,
    min_score=1,
    max_score=10,
    prompt="Rate this response: {{response}}",
)

Dataset Format

Upload a JSONL file with your evaluation data:

{"response": "AI is artificial intelligence.", "query": "What is AI?"}
{"response": "The capital of France is Paris.", "query": "What is the capital of France?"}

For Compare evaluations, include both responses:

{"response_a": "Answer from model A", "response_b": "Answer from model B", "query": "..."}

Manage Evaluations

client.evaluations.list()                  # List all evaluations
client.evaluations.retrieve(eval_id)       # Get status and results
client.evaluations.delete(eval_id)         # Delete evaluation

# Quick status check
curl -X GET "https://api.together.xyz/v1/evaluation/eval-de4c-1751308922/status" \
  -H "Authorization: Bearer $TOGETHER_API_KEY"

# Detailed information
curl -X GET "https://api.together.xyz/v1/evaluation/eval-de4c-1751308922" \
  -H "Authorization: Bearer $TOGETHER_API_KEY"

# CLI
together evals list
together evals list --status completed --limit 10
together evals retrieve <EVAL_ID>
together evals status <EVAL_ID>

UI-Based Evaluations

Create and monitor evaluations via the Together AI dashboard at api.together.xyz/evaluations â no code required.

Resources

Full API reference: See references/api-reference.md
Runnable script: See scripts/run_evaluation.py â classify evaluation with typed v2 SDK params
Official docs: AI Evaluations
API reference: Evaluations API

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台