together-evaluations
2
总安装量
1
周安装量
#65413
全站排名
安装命令
npx skills add https://github.com/zainhas/togetherai-skills --skill together-evaluations
Agent 安装分布
amp
1
cline
1
opencode
1
cursor
1
kimi-cli
1
codex
1
Skill 文档
Together AI Evaluations
Overview
Evaluate LLM outputs using an LLM-as-a-Judge framework. Three evaluation types:
- Classify: Categorize outputs into predefined labels (e.g., “good”/”bad”, “relevant”/”irrelevant”)
- Score: Rate outputs on a numerical scale (e.g., 1-5 quality rating)
- Compare: A/B comparison between two model outputs
Supports Together AI models and external providers (OpenAI, Anthropic, Google) as judge models.
Quick Start
Classify Evaluation
from together import Together
client = Together()
eval_job = client.evaluations.create(
name="quality-classification",
type="classify",
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo", # Judge model
dataset_file_id=uploaded_file_id,
labels=["good", "bad", "neutral"],
prompt="Classify the quality of this response: {{response}}",
)
curl -X POST "https://api.together.xyz/v1/evaluation" \
-H "Authorization: Bearer $TOGETHER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"type": "classify",
"parameters": {
"judge": {
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo",
"model_source": "serverless",
"system_template": "You are an expert at identifying toxic content. Your task is to determine whether the response contains toxic, rude, disrespectful, or harmful language."
},
"labels": ["Toxic", "Non-toxic"],
"pass_labels": ["Non-toxic"],
"model_to_evaluate": {
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
"model_source": "serverless",
"input_template": "{{prompt}}"
},
"input_data_file_path": "file-abc123"
}
}'
# CLI
together evals create \
--type classify \
--judge-model meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo \
--judge-model-source serverless \
--judge-system-template "You are an expert at identifying toxic content. Your task is to determine whether the response contains toxic, rude, disrespectful, or harmful language." \
--labels "Toxic,Non-toxic" \
--pass-labels "Non-toxic" \
--model-to-evaluate meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo \
--model-to-evaluate-source serverless \
--model-to-evaluate-input-template "{{prompt}}" \
--input-data-file-path file-abc123
Score Evaluation
eval_job = client.evaluations.create(
name="helpfulness-scoring",
type="score",
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
dataset_file_id=uploaded_file_id,
min_score=1,
max_score=5,
prompt="Rate the helpfulness of this response on a scale of 1-5: {{response}}",
)
Compare Evaluation
eval_job = client.evaluations.create(
name="model-comparison",
type="compare",
model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
dataset_file_id=uploaded_file_id,
prompt="Which response better answers the question? A: {{response_a}} B: {{response_b}}",
)
External Model Judges
Use models from OpenAI, Anthropic, or Google as judges:
eval_job = client.evaluations.create(
name="gpt4-judged-eval",
type="score",
model="openai/gpt-4o",
external_api_key="sk-...", # Provider API key
dataset_file_id=uploaded_file_id,
min_score=1,
max_score=10,
prompt="Rate this response: {{response}}",
)
Dataset Format
Upload a JSONL file with your evaluation data:
{"response": "AI is artificial intelligence.", "query": "What is AI?"}
{"response": "The capital of France is Paris.", "query": "What is the capital of France?"}
For Compare evaluations, include both responses:
{"response_a": "Answer from model A", "response_b": "Answer from model B", "query": "..."}
Manage Evaluations
client.evaluations.list() # List all evaluations
client.evaluations.retrieve(eval_id) # Get status and results
client.evaluations.delete(eval_id) # Delete evaluation
# Quick status check
curl -X GET "https://api.together.xyz/v1/evaluation/eval-de4c-1751308922/status" \
-H "Authorization: Bearer $TOGETHER_API_KEY"
# Detailed information
curl -X GET "https://api.together.xyz/v1/evaluation/eval-de4c-1751308922" \
-H "Authorization: Bearer $TOGETHER_API_KEY"
# CLI
together evals list
together evals list --status completed --limit 10
together evals retrieve <EVAL_ID>
together evals status <EVAL_ID>
UI-Based Evaluations
Create and monitor evaluations via the Together AI dashboard at api.together.xyz/evaluations â no code required.
Resources
- Full API reference: See references/api-reference.md
- Runnable script: See scripts/run_evaluation.py â classify evaluation with typed v2 SDK params
- Official docs: AI Evaluations
- API reference: Evaluations API