aws-bedrock-evals

📁 antstackio/skills 📅 2 days ago
1
总安装量
1
周安装量
#43755
全站排名
安装命令
npx skills add https://github.com/antstackio/skills --skill aws-bedrock-evals

Agent 安装分布

amp 1
opencode 1
kimi-cli 1
codex 1
claude-code 1

Skill 文档

AWS Bedrock Evaluation Jobs


Overview

Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the “judge”) to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.

Pre-computed Inference vs Live Inference

Mode How it works Use when
Live Inference Bedrock generates responses during the eval job Simple prompt-in/text-out, no tool calling
Pre-computed Inference You pre-collect responses and supply them in a JSONL dataset Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock

Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.

Pipeline

Design Scenarios → Collect Responses → Upload to S3 → Run Eval Job → Parse Results → Act on Findings
       |                  |                  |               |               |               |
  scenarios.json    Your app's API     s3://bucket/     create-         s3 sync +       Fix prompt,
  (multi-turn)      → dataset JSONL    datasets/        evaluation-job  parse JSONL      retune metrics

Agent Behavior: Gather Inputs and Show Cost Estimate

Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:

  1. AWS Region — Which region to use (default: us-east-1). Affects model availability and pricing.
  2. Target model — The model their application uses (e.g., amazon.nova-lite-v1:0, anthropic.claude-3-haiku).
  3. Evaluator (judge) model — The model to score responses (e.g., amazon.nova-pro-v1:0). Should be at least as capable as the target.
  4. Application type — Brief description of what the app does. Used to design test scenarios and derive custom metrics.
  5. Number of test scenarios — How many they plan to test (recommend 13-20 for first run).
  6. Estimated JSONL entries — Derived from scenarios x avg turns per scenario.
  7. Number of metrics — Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
  8. S3 bucket — Existing bucket name or confirm creation of a new one.
  9. IAM role — Existing role ARN or confirm creation of a new one.

Cost Estimate

After gathering inputs, you MUST display a cost estimate before proceeding:

## Estimated Cost Summary

| Item | Details | Est. Cost |
|------|---------|-----------|
| Response collection | {N} prompts x ~{T} tokens x {target_model_price} | ${X.XX} |
| Evaluation job | {N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price} | ${X.XX} |
| S3 storage | < 1 MB | < $0.01 |
| **Total per run** | | **~${X.XX}** |

Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.

Cost formulas:

  • Response collection: num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price
  • Evaluation job: num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price

Model pricing reference:

Model Input (per 1M tokens) Output (per 1M tokens)
amazon.nova-lite-v1:0 $0.06 $0.24
amazon.nova-pro-v1:0 $0.80 $3.20
anthropic.claude-3-haiku $0.25 $1.25
anthropic.claude-3-sonnet $3.00 $15.00

Prerequisites

# AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)
aws --version

# Verify target model access
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION

# Verify evaluator model access
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION

Good evaluator model choices: amazon.nova-pro-v1:0, anthropic.claude-3-sonnet, anthropic.claude-3-haiku. The evaluator should be at least as capable as your target model.


Step 1: Design Test Scenarios

List the application’s functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.

Scenario JSON format:

[
  {
    "id": "greeting-known-user",
    "category": "greeting",
    "context": { "userId": "user-123" },
    "turns": ["hello"]
  },
  {
    "id": "multi-step-flow",
    "category": "core-flow",
    "context": { "userId": "user-456" },
    "turns": [
      "hello",
      "I need help with X",
      "yes, proceed with that",
      "thanks"
    ]
  }
]

The context field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.

Edge case coverage dimensions:

  • Happy path: standard usage that should work perfectly
  • Missing information: user omits required fields
  • Unavailable resources: requested item doesn’t exist
  • Out-of-scope requests: user asks something the app shouldn’t handle
  • Error recovery: bad input, invalid data
  • Tone stress tests: complaints, frustration

Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).


Step 2: Collect Responses

Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model’s response, and metadata.

Example pattern: Converse API with tool-calling loop (TypeScript)

This applies when your application uses Bedrock with tool calling:

import {
  BedrockRuntimeClient,
  ConverseCommand,
  type Message,
  type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function converseLoop(
  messages: Message[],
  systemPrompt: SystemContentBlock[],
  tools: any[]
): Promise<string> {
  const MAX_TOOL_ROUNDS = 10;

  for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
    const response = await client.send(
      new ConverseCommand({
        modelId: "TARGET_MODEL_ID",
        system: systemPrompt,
        messages,
        toolConfig: { tools },
        inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
      })
    );

    const assistantContent = response.output?.message?.content as any[];
    if (!assistantContent) return "[No response from model]";

    messages.push({ role: "assistant", content: assistantContent });

    const toolUseBlocks = assistantContent.filter(
      (block: any) => block.toolUse != null
    );

    if (toolUseBlocks.length === 0) {
      return assistantContent
        .filter((block: any) => block.text != null)
        .map((block: any) => block.text as string)
        .join("\n") || "[Empty response]";
    }

    const toolResultBlocks: any[] = [];
    for (const block of toolUseBlocks) {
      const { toolUseId, name, input } = block.toolUse;
      const result = await executeTool(name, input);
      toolResultBlocks.push({
        toolResult: { toolUseId, content: [{ json: result }] },
      });
    }

    messages.push({ role: "user", content: toolResultBlocks } as Message);
  }

  return "[Max tool rounds exceeded]";
}

Multi-turn handling: Maintain the messages array across turns and build the dataset prompt field with conversation history:

const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];

for (let i = 0; i < scenario.turns.length; i++) {
  const userTurn = scenario.turns[i];
  messages.push({ role: "user", content: [{ text: userTurn }] });

  const assistantText = await converseLoop(messages, systemPrompt, tools);

  conversationHistory.push({ role: "user", text: userTurn });
  conversationHistory.push({ role: "assistant", text: assistantText });

  let prompt: string;
  if (i === 0) {
    prompt = userTurn;
  } else {
    prompt = conversationHistory
      .map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
      .join("\n");
  }

  entries.push({
    prompt,
    category: scenario.category,
    referenceResponse: "",
    modelResponses: [
      { response: assistantText, modelIdentifier: "my-app-v1" },
    ],
  });
}

Dataset JSONL Format

Each line must have this structure:

{
  "prompt": "User question or multi-turn history",
  "referenceResponse": "",
  "modelResponses": [
    {
      "response": "The model's actual output text",
      "modelIdentifier": "my-app-v1"
    }
  ]
}
Field Required Notes
prompt Yes User input. For multi-turn, concatenate: User: ...\nAssistant: ...\nUser: ...
referenceResponse No Expected/ideal response. Can be empty string. Needed for Builtin.Correctness and Builtin.Completeness to work properly. Maps to {{ground_truth}} template variable
modelResponses Yes Array with exactly one entry for pre-computed inference
modelResponses[0].response Yes The model’s actual output text
modelResponses[0].modelIdentifier Yes Any string label. Must match inferenceSourceIdentifier in inference-config.json

Constraints: One model response per prompt. One unique modelIdentifier per job. Max 1000 prompts per job.

Write JSONL:

const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");

Step 3: Design Metrics

Built-In Metrics

Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:

Metric Name What It Measures
Builtin.Correctness Is the factual content accurate? (works best with referenceResponse)
Builtin.Completeness Does the response fully cover the request? (works best with referenceResponse)
Builtin.Faithfulness Is the response faithful to the provided context/source?
Builtin.Helpfulness Is the response useful, actionable, and cooperative?
Builtin.Coherence Is the response logically structured and easy to follow?
Builtin.Relevance Does the response address the actual question?
Builtin.FollowingInstructions Does the response follow explicit instructions in the prompt?
Builtin.ProfessionalStyleAndTone Is spelling, grammar, and tone appropriate?
Builtin.Harmfulness Does the response contain harmful content?
Builtin.Stereotyping Does the response contain stereotypes or bias?
Builtin.Refusal Does the response appropriately refuse harmful requests?

Score interpretation: 1.0 = best, 0.0 = worst, null = N/A (judge could not evaluate).

Note: referenceResponse is needed for Builtin.Correctness and Builtin.Completeness to produce meaningful scores, since the judge compares against a reference baseline.

When to Use Custom Metrics

Use custom metrics to check domain-specific behaviors the built-in metrics don’t cover. If you find yourself thinking “this scored well on Helpfulness but violated a critical business rule” — that’s a custom metric.

Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:

System prompt says:                          Candidate metric:
────────────────────────────────────────────────────────────────
"Keep responses to 2-3 sentences max"     → response_brevity
"Always greet returning users by name"    → personalized_greeting
"Never proceed without user confirmation" → confirmation_check
"Ask for missing details, don't assume"   → missing_info_followup

Custom Metric JSON Anatomy

{
  "customMetricDefinition": {
    "metricName": "my_metric_name",
    "instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}
Field Details
metricName Snake_case identifier. Must appear in BOTH customMetrics array AND metricNames array
instructions Full prompt sent to the judge. Must include {{prompt}} and {{prediction}} template variables. Can also use {{ground_truth}} (maps to referenceResponse). Input variables must come last in the prompt.
ratingScale Array of rating levels. Each has a definition (label, max 5 words / 100 chars) and value with either floatValue or stringValue

Official constraints:

  • Max 10 custom metrics per job
  • Instructions max 5000 characters
  • Rating definition max 5 words / 100 characters
  • Input variables ({{prompt}}, {{prediction}}, {{ground_truth}}) must come last in the instruction text

Complete Custom Metric Example

A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:

{
  "customMetricDefinition": {
    "metricName": "confirmation_check",
    "instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "N/A", "value": { "floatValue": -1 } },
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

When the judge selects N/A (floatValue: -1), Bedrock records "result": null. Your parser must handle null — treat as N/A and exclude from averages.

Rating Scale Design

  • 3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
  • 2 levels for binary checks (Poor/Good)
  • Add “N/A” level with -1 for conditional metrics that only apply to certain prompt types
  • Rating values can use floatValue (numeric) or stringValue (text)

Tips for Writing Metric Instructions

  • Be explicit about what “good” and “bad” look like — include examples of phrases or behaviors
  • For conditional metrics, describe the N/A condition clearly so the judge doesn’t score 0 when it should skip
  • Keep instructions under ~500 words to fit within context alongside prompt and response
  • Test with a few examples before running a full eval job

Step 4: AWS Infrastructure

S3 Bucket

REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"

# us-east-1 does not accept LocationConstraint
if [ "${REGION}" = "us-east-1" ]; then
  aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
  aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" \
    --create-bucket-configuration LocationConstraint="${REGION}"
fi

Upload the dataset:

aws s3 cp datasets/collected-responses.jsonl \
  "s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"

IAM Role

Trust policy (must include aws:SourceAccount condition — Bedrock rejects the role without it):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "YOUR_ACCOUNT_ID"
        }
      }
    }
  ]
}

Permissions policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DatasetRead",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET",
        "arn:aws:s3:::YOUR_BUCKET/datasets/*"
      ]
    },
    {
      "Sid": "S3ResultsWrite",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
    },
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
      ]
    }
  ]
}

Replace YOUR_BUCKET, REGION, and EVALUATOR_MODEL_ID with actual values.

Create the role:

ROLE_NAME="BedrockEvalRole"

ROLE_ARN=$(aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Allows Bedrock to run evaluation jobs" \
  --query "Role.Arn" --output text)

aws iam put-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-name "BedrockEvalPolicy" \
  --policy-document file://permissions-policy.json

Step 5: Configure and Run Eval Job

eval-config.json

{
  "automated": {
    "datasetMetricConfigs": [
      {
        "taskType": "General",
        "dataset": {
          "name": "my-eval-dataset",
          "datasetLocation": {
            "s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
          }
        },
        "metricNames": [
          "Builtin.Helpfulness",
          "Builtin.FollowingInstructions",
          "Builtin.ProfessionalStyleAndTone",
          "Builtin.Relevance",
          "Builtin.Completeness",
          "Builtin.Correctness",
          "my_custom_metric_1",
          "my_custom_metric_2"
        ]
      }
    ],
    "evaluatorModelConfig": {
      "bedrockEvaluatorModels": [
        { "modelIdentifier": "EVALUATOR_MODEL_ID" }
      ]
    },
    "customMetricConfig": {
      "customMetrics": [
        {
          "customMetricDefinition": {
            "metricName": "my_custom_metric_1",
            "instructions": "... {{prompt}} ... {{prediction}} ...",
            "ratingScale": [
              { "definition": "Poor", "value": { "floatValue": 0 } },
              { "definition": "Good", "value": { "floatValue": 1 } }
            ]
          }
        }
      ],
      "evaluatorModelConfig": {
        "bedrockEvaluatorModels": [
          { "modelIdentifier": "EVALUATOR_MODEL_ID" }
        ]
      }
    }
  }
}

Critical structure notes:

  1. taskType must be "General" (not “Generation” or any other value)
  2. Custom metric names must appear in both metricNames array AND customMetrics array
  3. evaluatorModelConfig appears twice: once at the top level (for built-in metrics) and once inside customMetricConfig (for custom metrics) — both must specify the same evaluator model
  4. modelIdentifier must be the exact model ID string matching across all configs

inference-config.json

For pre-computed inference, this tells Bedrock that responses are already collected:

{
  "models": [
    {
      "precomputedInferenceSource": {
        "inferenceSourceIdentifier": "my-app-v1"
      }
    }
  ]
}

The inferenceSourceIdentifier must match the modelIdentifier in your JSONL dataset’s modelResponses.

Running the Job

aws bedrock create-evaluation-job \
  --job-name "my-eval-$(date +%Y%m%d-%H%M)" \
  --role-arn "${ROLE_ARN}" \
  --evaluation-config file://eval-config.json \
  --inference-config file://inference-config.json \
  --output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
  --region us-east-1

CLI notes:

  • Required params: --job-name, --role-arn, --evaluation-config, --inference-config, --output-data-config
  • Optional: --application-type (e.g., ModelEvaluation)
  • --job-name constraint: [a-z0-9](-*[a-z0-9]){0,62} — lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).
  • --evaluation-config and --inference-config are document types — must use file:// or inline JSON, no shorthand syntax
  • --output-data-config is a structure — supports both inline JSON and shorthand (s3Uri=string)

Monitoring

# List evaluation jobs (with optional filters)
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1

# Get details for a specific job
aws bedrock get-evaluation-job \
  --job-identifier "JOB_ARN" \
  --region us-east-1

# Cancel a running job
aws bedrock stop-evaluation-job \
  --job-identifier "JOB_ARN" \
  --region us-east-1

Job statuses: InProgress, Completed, Failed, Stopping, Stopped, Deleting

Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check failureMessages in the job details.


Step 6: Parse Results

S3 Output Directory Structure

Bedrock writes results to a deeply nested path:

s3://YOUR_BUCKET/results/
  └── <job-name>/
      └── <job-name>/
          ├── amazon-bedrock-evaluations-permission-check   ← empty sentinel
          └── <random-id>/
              ├── custom_metrics/                            ← metric definitions (NOT results)
              └── models/
                  └── <model-identifier>/
                      └── taskTypes/General/datasets/<dataset-name>/
                          └── <uuid>_output.jsonl            ← actual results

The job name is repeated twice. The random ID changes every run. Use aws s3 sync — do not construct paths manually.

Download Results

aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1

Result JSONL Format

Each line:

{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Builtin.Helpfulness",
        "result": 0.6667,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "The response provides useful information..."
          }
        ]
      },
      {
        "metricName": "confirmation_check",
        "result": null,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "This conversation does not involve any consequential action..."
          }
        ]
      }
    ]
  },
  "inputRecord": {
    "prompt": "hello",
    "referenceResponse": "",
    "modelResponses": [
      { "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
    ]
  }
}
  • result is a number (score) or null (N/A)
  • evaluatorDetails[0].explanation contains the judge’s written reasoning

Parsing and Aggregation

interface PromptResult {
  prompt: string;
  category: string;
  modelResponse: string;
  scores: Record<string, {
    score: string;
    reasoning?: string;
    rawScore?: number;
  }>;
}

for (const s of entry.automatedEvaluationResult.scores) {
  scores[s.metricName] = {
    score: s.result === null ? "N/A" : String(s.result),
    reasoning: s.evaluatorDetails?.[0]?.explanation,
    rawScore: typeof s.result === "number" ? s.result : undefined,
  };
}

Aggregation approach:

  1. Overall averages per metric — exclude N/A entries
  2. Per-category breakdown — group by category field, compute averages within each
  3. Low-score alerts — flag entries below threshold (built-in < 0.5, custom <= 0)

Low-score alert format:

[Builtin.Relevance] score=0.50 | "hello..."
  Reason: The response does not directly address the greeting...

[confirmation_check] score=0.00 | "User: proceed with X..."
  Reason: The assistant executed the action without asking for confirmation...

Step 7: Eval-Fix-Reeval Loop

Common Fixes

Finding Fix
Low brevity scores Add hard constraint: “Respond in no more than 3 sentences.”
Low confirmation_check Add: “Before executing, summarize details and ask for confirmation.”
Low missing_info_followup Add: “If any required field is missing, ask for it. Do not assume.”
Low tone on negative outcomes Add empathy instructions for bad-news scenarios
Low Completeness on simple prompts Metric/data issue — add referenceResponse or filter from Completeness

Metric Refinement

  • High N/A rates (>60%) — metric too narrowly scoped. Split dataset or adjust scope.
  • All-high scores — instructions too lenient. Add specific failure criteria.
  • Inconsistent scoring — instructions ambiguous. Add concrete examples per rating level.

Run Comparison

Run 1 (baseline):    response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes):  response_brevity avg=0.85, custom_tone avg=0.90

Track scores over time. The pipeline’s value comes from repeated measurement.


Gotchas

  1. taskType must be "General" — not “Generation” or any other value. The job fails silently with other values.

  2. Custom metric names in BOTH places — must appear in metricNames array AND customMetrics array. Missing from metricNames = silently ignored. Missing from customMetrics = job fails.

  3. null result means N/A, not 0 — when the judge determines a metric doesn’t apply, Bedrock records null:

    // WRONG — treats N/A as 0
    const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;
    
    // RIGHT — excludes N/A from average
    const numericScores = scores.filter((s): s is number => s !== null);
    const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;
    
  4. evaluatorModelConfig appears twice — once at top level (built-in metrics), once inside customMetricConfig (custom metrics). Omitting either causes those metrics to fail.

  5. modelIdentifier must match exactly — the string in JSONL modelResponses must be character-for-character identical to inferenceSourceIdentifier in inference-config.json. Mismatch = model mapping error.

  6. AWS CLI 2.33+ required — older versions silently drop customMetricConfig and precomputedInferenceSource. Job creation succeeds but the job fails. Always check aws --version.

  7. Job names: lowercase + hyphens, max 63 chars — pattern: [a-z0-9](-*[a-z0-9]){0,62}. Must be unique across all jobs. Use timestamps: --job-name "my-eval-$(date +%Y%m%d-%H%M)".

  8. S3 output is deeply nested — <prefix>/<job-name>/<job-name>/<random-id>/models/.... Use aws s3 sync and search for _output.jsonl. Do not construct paths manually.

  9. referenceResponse improves Correctness/Completeness — empty string is valid, but providing reference responses gives the judge a baseline for comparison.

  10. <thinking> tag leakage (model-specific) — some models (e.g., Amazon Nova Lite) may leak <thinking>...</thinking> blocks into responses. If present, strip before writing JSONL:

    const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
    
  11. us-east-1 S3 bucket creation — do NOT pass LocationConstraint for us-east-1. Other regions require it.


Cost Estimation

Formula:

Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price

Example: 30 prompts, 10 metrics, Nova Pro judge:

  • Response collection (Nova Lite): ~$0.02
  • Evaluation job (Nova Pro): ~$0.58
  • Total per run: ~$0.61

Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics ≈ $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).


References