aws-bedrock-evals

📁 antstackio/skills 📅 2 days ago

总安装量

周安装量

#43755

全站排名

安装命令

npx skills add https://github.com/antstackio/skills --skill aws-bedrock-evals

Agent 安装分布

amp 1

opencode 1

kimi-cli 1

codex 1

claude-code 1

Skill 文档

AWS Bedrock Evaluation Jobs

Overview

Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the “judge”) to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.

Pre-computed Inference vs Live Inference

Mode	How it works	Use when
Live Inference	Bedrock generates responses during the eval job	Simple prompt-in/text-out, no tool calling
Pre-computed Inference	You pre-collect responses and supply them in a JSONL dataset	Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock

Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.

Pipeline

Design Scenarios â Collect Responses â Upload to S3 â Run Eval Job â Parse Results â Act on Findings
       |                  |                  |               |               |               |
  scenarios.json    Your app's API     s3://bucket/     create-         s3 sync +       Fix prompt,
  (multi-turn)      â dataset JSONL    datasets/        evaluation-job  parse JSONL      retune metrics

Agent Behavior: Gather Inputs and Show Cost Estimate

Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:

AWS Region â Which region to use (default: us-east-1). Affects model availability and pricing.
Target model â The model their application uses (e.g., amazon.nova-lite-v1:0, anthropic.claude-3-haiku).
Evaluator (judge) model â The model to score responses (e.g., amazon.nova-pro-v1:0). Should be at least as capable as the target.
Application type â Brief description of what the app does. Used to design test scenarios and derive custom metrics.
Number of test scenarios â How many they plan to test (recommend 13-20 for first run).
Estimated JSONL entries â Derived from scenarios x avg turns per scenario.
Number of metrics â Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
S3 bucket â Existing bucket name or confirm creation of a new one.
IAM role â Existing role ARN or confirm creation of a new one.

Cost Estimate

After gathering inputs, you MUST display a cost estimate before proceeding:

## Estimated Cost Summary

| Item | Details | Est. Cost |
|------|---------|-----------|
| Response collection | {N} prompts x ~{T} tokens x {target_model_price} | ${X.XX} |
| Evaluation job | {N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price} | ${X.XX} |
| S3 storage | < 1 MB | < $0.01 |
| **Total per run** | | **~${X.XX}** |

Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.

Cost formulas:

Response collection: num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price
Evaluation job: num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price

Model pricing reference:

Model	Input (per 1M tokens)	Output (per 1M tokens)
amazon.nova-lite-v1:0	$0.06	$0.24
amazon.nova-pro-v1:0	$0.80	$3.20
anthropic.claude-3-haiku	$0.25	$1.25
anthropic.claude-3-sonnet	$3.00	$15.00

Prerequisites

# AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)
aws --version

# Verify target model access
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION

# Verify evaluator model access
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION

Good evaluator model choices: amazon.nova-pro-v1:0, anthropic.claude-3-sonnet, anthropic.claude-3-haiku. The evaluator should be at least as capable as your target model.

Step 1: Design Test Scenarios

List the application’s functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.

Scenario JSON format:

[
  {
    "id": "greeting-known-user",
    "category": "greeting",
    "context": { "userId": "user-123" },
    "turns": ["hello"]
  },
  {
    "id": "multi-step-flow",
    "category": "core-flow",
    "context": { "userId": "user-456" },
    "turns": [
      "hello",
      "I need help with X",
      "yes, proceed with that",
      "thanks"
    ]
  }
]

The context field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.

Edge case coverage dimensions:

Happy path: standard usage that should work perfectly
Missing information: user omits required fields
Unavailable resources: requested item doesn’t exist
Out-of-scope requests: user asks something the app shouldn’t handle
Error recovery: bad input, invalid data
Tone stress tests: complaints, frustration

Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).

Step 2: Collect Responses

Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model’s response, and metadata.

Example pattern: Converse API with tool-calling loop (TypeScript)

This applies when your application uses Bedrock with tool calling:

import {
  BedrockRuntimeClient,
  ConverseCommand,
  type Message,
  type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

async function converseLoop(
  messages: Message[],
  systemPrompt: SystemContentBlock[],
  tools: any[]
): Promise<string> {
  const MAX_TOOL_ROUNDS = 10;

  for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
    const response = await client.send(
      new ConverseCommand({
        modelId: "TARGET_MODEL_ID",
        system: systemPrompt,
        messages,
        toolConfig: { tools },
        inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
      })
    );

    const assistantContent = response.output?.message?.content as any[];
    if (!assistantContent) return "[No response from model]";

    messages.push({ role: "assistant", content: assistantContent });

    const toolUseBlocks = assistantContent.filter(
      (block: any) => block.toolUse != null
    );

    if (toolUseBlocks.length === 0) {
      return assistantContent
        .filter((block: any) => block.text != null)
        .map((block: any) => block.text as string)
        .join("\n") || "[Empty response]";
    }

    const toolResultBlocks: any[] = [];
    for (const block of toolUseBlocks) {
      const { toolUseId, name, input } = block.toolUse;
      const result = await executeTool(name, input);
      toolResultBlocks.push({
        toolResult: { toolUseId, content: [{ json: result }] },
      });
    }

    messages.push({ role: "user", content: toolResultBlocks } as Message);
  }

  return "[Max tool rounds exceeded]";
}

Multi-turn handling: Maintain the messages array across turns and build the dataset prompt field with conversation history:

const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];

for (let i = 0; i < scenario.turns.length; i++) {
  const userTurn = scenario.turns[i];
  messages.push({ role: "user", content: [{ text: userTurn }] });

  const assistantText = await converseLoop(messages, systemPrompt, tools);

  conversationHistory.push({ role: "user", text: userTurn });
  conversationHistory.push({ role: "assistant", text: assistantText });

  let prompt: string;
  if (i === 0) {
    prompt = userTurn;
  } else {
    prompt = conversationHistory
      .map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
      .join("\n");
  }

  entries.push({
    prompt,
    category: scenario.category,
    referenceResponse: "",
    modelResponses: [
      { response: assistantText, modelIdentifier: "my-app-v1" },
    ],
  });
}

Dataset JSONL Format

Each line must have this structure:

{
  "prompt": "User question or multi-turn history",
  "referenceResponse": "",
  "modelResponses": [
    {
      "response": "The model's actual output text",
      "modelIdentifier": "my-app-v1"
    }
  ]
}

Field	Required	Notes
`prompt`	Yes	User input. For multi-turn, concatenate: `User: ...\nAssistant: ...\nUser: ...`
`referenceResponse`	No	Expected/ideal response. Can be empty string. Needed for `Builtin.Correctness` and `Builtin.Completeness` to work properly. Maps to `{{ground_truth}}` template variable
`modelResponses`	Yes	Array with exactly one entry for pre-computed inference
`modelResponses[0].response`	Yes	The model’s actual output text
`modelResponses[0].modelIdentifier`	Yes	Any string label. Must match `inferenceSourceIdentifier` in inference-config.json

Constraints: One model response per prompt. One unique modelIdentifier per job. Max 1000 prompts per job.

Write JSONL:

const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");

Step 3: Design Metrics

Built-In Metrics

Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:

Metric Name	What It Measures
`Builtin.Correctness`	Is the factual content accurate? (works best with `referenceResponse`)
`Builtin.Completeness`	Does the response fully cover the request? (works best with `referenceResponse`)
`Builtin.Faithfulness`	Is the response faithful to the provided context/source?
`Builtin.Helpfulness`	Is the response useful, actionable, and cooperative?
`Builtin.Coherence`	Is the response logically structured and easy to follow?
`Builtin.Relevance`	Does the response address the actual question?
`Builtin.FollowingInstructions`	Does the response follow explicit instructions in the prompt?
`Builtin.ProfessionalStyleAndTone`	Is spelling, grammar, and tone appropriate?
`Builtin.Harmfulness`	Does the response contain harmful content?
`Builtin.Stereotyping`	Does the response contain stereotypes or bias?
`Builtin.Refusal`	Does the response appropriately refuse harmful requests?

Score interpretation: 1.0 = best, 0.0 = worst, null = N/A (judge could not evaluate).

Note: referenceResponse is needed for Builtin.Correctness and Builtin.Completeness to produce meaningful scores, since the judge compares against a reference baseline.

When to Use Custom Metrics

Use custom metrics to check domain-specific behaviors the built-in metrics don’t cover. If you find yourself thinking “this scored well on Helpfulness but violated a critical business rule” â that’s a custom metric.

Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:

System prompt says:                          Candidate metric:
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
"Keep responses to 2-3 sentences max"     â response_brevity
"Always greet returning users by name"    â personalized_greeting
"Never proceed without user confirmation" â confirmation_check
"Ask for missing details, don't assume"   â missing_info_followup

Custom Metric JSON Anatomy

{
  "customMetricDefinition": {
    "metricName": "my_metric_name",
    "instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

Field	Details
`metricName`	Snake_case identifier. Must appear in BOTH `customMetrics` array AND `metricNames` array
`instructions`	Full prompt sent to the judge. Must include `{{prompt}}` and `{{prediction}}` template variables. Can also use `{{ground_truth}}` (maps to `referenceResponse`). Input variables must come last in the prompt.
`ratingScale`	Array of rating levels. Each has a `definition` (label, max 5 words / 100 chars) and `value` with either `floatValue` or `stringValue`

Official constraints:

Max 10 custom metrics per job
Instructions max 5000 characters
Rating definition max 5 words / 100 characters
Input variables ({{prompt}}, {{prediction}}, {{ground_truth}}) must come last in the instruction text

Complete Custom Metric Example

A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:

{
  "customMetricDefinition": {
    "metricName": "confirmation_check",
    "instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
    "ratingScale": [
      { "definition": "N/A", "value": { "floatValue": -1 } },
      { "definition": "Poor", "value": { "floatValue": 0 } },
      { "definition": "Good", "value": { "floatValue": 1 } }
    ]
  }
}

When the judge selects N/A (floatValue: -1), Bedrock records "result": null. Your parser must handle null â treat as N/A and exclude from averages.

Rating Scale Design

3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
2 levels for binary checks (Poor/Good)
Add “N/A” level with -1 for conditional metrics that only apply to certain prompt types
Rating values can use floatValue (numeric) or stringValue (text)

Tips for Writing Metric Instructions

Be explicit about what “good” and “bad” look like â include examples of phrases or behaviors
For conditional metrics, describe the N/A condition clearly so the judge doesn’t score 0 when it should skip
Keep instructions under ~500 words to fit within context alongside prompt and response
Test with a few examples before running a full eval job

Step 4: AWS Infrastructure

S3 Bucket

REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"

# us-east-1 does not accept LocationConstraint
if [ "${REGION}" = "us-east-1" ]; then
  aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
  aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" \
    --create-bucket-configuration LocationConstraint="${REGION}"
fi

Upload the dataset:

aws s3 cp datasets/collected-responses.jsonl \
  "s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"

IAM Role

Trust policy (must include aws:SourceAccount condition â Bedrock rejects the role without it):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "bedrock.amazonaws.com" },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "aws:SourceAccount": "YOUR_ACCOUNT_ID"
        }
      }
    }
  ]
}

Permissions policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3DatasetRead",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:ListBucket"],
      "Resource": [
        "arn:aws:s3:::YOUR_BUCKET",
        "arn:aws:s3:::YOUR_BUCKET/datasets/*"
      ]
    },
    {
      "Sid": "S3ResultsWrite",
      "Effect": "Allow",
      "Action": ["s3:PutObject", "s3:GetObject"],
      "Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
    },
    {
      "Sid": "BedrockModelInvoke",
      "Effect": "Allow",
      "Action": ["bedrock:InvokeModel"],
      "Resource": [
        "arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
      ]
    }
  ]
}

Replace YOUR_BUCKET, REGION, and EVALUATOR_MODEL_ID with actual values.

Create the role:

ROLE_NAME="BedrockEvalRole"

ROLE_ARN=$(aws iam create-role \
  --role-name "${ROLE_NAME}" \
  --assume-role-policy-document file://trust-policy.json \
  --description "Allows Bedrock to run evaluation jobs" \
  --query "Role.Arn" --output text)

aws iam put-role-policy \
  --role-name "${ROLE_NAME}" \
  --policy-name "BedrockEvalPolicy" \
  --policy-document file://permissions-policy.json

Step 5: Configure and Run Eval Job

eval-config.json

{
  "automated": {
    "datasetMetricConfigs": [
      {
        "taskType": "General",
        "dataset": {
          "name": "my-eval-dataset",
          "datasetLocation": {
            "s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
          }
        },
        "metricNames": [
          "Builtin.Helpfulness",
          "Builtin.FollowingInstructions",
          "Builtin.ProfessionalStyleAndTone",
          "Builtin.Relevance",
          "Builtin.Completeness",
          "Builtin.Correctness",
          "my_custom_metric_1",
          "my_custom_metric_2"
        ]
      }
    ],
    "evaluatorModelConfig": {
      "bedrockEvaluatorModels": [
        { "modelIdentifier": "EVALUATOR_MODEL_ID" }
      ]
    },
    "customMetricConfig": {
      "customMetrics": [
        {
          "customMetricDefinition": {
            "metricName": "my_custom_metric_1",
            "instructions": "... {{prompt}} ... {{prediction}} ...",
            "ratingScale": [
              { "definition": "Poor", "value": { "floatValue": 0 } },
              { "definition": "Good", "value": { "floatValue": 1 } }
            ]
          }
        }
      ],
      "evaluatorModelConfig": {
        "bedrockEvaluatorModels": [
          { "modelIdentifier": "EVALUATOR_MODEL_ID" }
        ]
      }
    }
  }
}

Critical structure notes:

taskType must be "General" (not “Generation” or any other value)
Custom metric names must appear in both metricNames array AND customMetrics array
evaluatorModelConfig appears twice: once at the top level (for built-in metrics) and once inside customMetricConfig (for custom metrics) â both must specify the same evaluator model
modelIdentifier must be the exact model ID string matching across all configs

inference-config.json

For pre-computed inference, this tells Bedrock that responses are already collected:

{
  "models": [
    {
      "precomputedInferenceSource": {
        "inferenceSourceIdentifier": "my-app-v1"
      }
    }
  ]
}

The inferenceSourceIdentifier must match the modelIdentifier in your JSONL dataset’s modelResponses.

Running the Job

aws bedrock create-evaluation-job \
  --job-name "my-eval-$(date +%Y%m%d-%H%M)" \
  --role-arn "${ROLE_ARN}" \
  --evaluation-config file://eval-config.json \
  --inference-config file://inference-config.json \
  --output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
  --region us-east-1

CLI notes:

Required params: --job-name, --role-arn, --evaluation-config, --inference-config, --output-data-config
Optional: --application-type (e.g., ModelEvaluation)
--job-name constraint: [a-z0-9](-*[a-z0-9]){0,62} â lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).
--evaluation-config and --inference-config are document types â must use file:// or inline JSON, no shorthand syntax
--output-data-config is a structure â supports both inline JSON and shorthand (s3Uri=string)

Monitoring

# List evaluation jobs (with optional filters)
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1

# Get details for a specific job
aws bedrock get-evaluation-job \
  --job-identifier "JOB_ARN" \
  --region us-east-1

# Cancel a running job
aws bedrock stop-evaluation-job \
  --job-identifier "JOB_ARN" \
  --region us-east-1

Job statuses: InProgress, Completed, Failed, Stopping, Stopped, Deleting

Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check failureMessages in the job details.

Step 6: Parse Results

S3 Output Directory Structure

Bedrock writes results to a deeply nested path:

s3://YOUR_BUCKET/results/
  âââ <job-name>/
      âââ <job-name>/
          âââ amazon-bedrock-evaluations-permission-check   â empty sentinel
          âââ <random-id>/
              âââ custom_metrics/                            â metric definitions (NOT results)
              âââ models/
                  âââ <model-identifier>/
                      âââ taskTypes/General/datasets/<dataset-name>/
                          âââ <uuid>_output.jsonl            â actual results

The job name is repeated twice. The random ID changes every run. Use aws s3 sync â do not construct paths manually.

Download Results

aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1

Result JSONL Format

Each line:

{
  "automatedEvaluationResult": {
    "scores": [
      {
        "metricName": "Builtin.Helpfulness",
        "result": 0.6667,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "The response provides useful information..."
          }
        ]
      },
      {
        "metricName": "confirmation_check",
        "result": null,
        "evaluatorDetails": [
          {
            "modelIdentifier": "amazon.nova-pro-v1:0",
            "explanation": "This conversation does not involve any consequential action..."
          }
        ]
      }
    ]
  },
  "inputRecord": {
    "prompt": "hello",
    "referenceResponse": "",
    "modelResponses": [
      { "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
    ]
  }
}

result is a number (score) or null (N/A)
evaluatorDetails[0].explanation contains the judge’s written reasoning

Parsing and Aggregation

interface PromptResult {
  prompt: string;
  category: string;
  modelResponse: string;
  scores: Record<string, {
    score: string;
    reasoning?: string;
    rawScore?: number;
  }>;
}

for (const s of entry.automatedEvaluationResult.scores) {
  scores[s.metricName] = {
    score: s.result === null ? "N/A" : String(s.result),
    reasoning: s.evaluatorDetails?.[0]?.explanation,
    rawScore: typeof s.result === "number" ? s.result : undefined,
  };
}

Aggregation approach:

Overall averages per metric â exclude N/A entries
Per-category breakdown â group by category field, compute averages within each
Low-score alerts â flag entries below threshold (built-in < 0.5, custom <= 0)

Low-score alert format:

[Builtin.Relevance] score=0.50 | "hello..."
  Reason: The response does not directly address the greeting...

[confirmation_check] score=0.00 | "User: proceed with X..."
  Reason: The assistant executed the action without asking for confirmation...

Step 7: Eval-Fix-Reeval Loop

Common Fixes

Finding	Fix
Low brevity scores	Add hard constraint: “Respond in no more than 3 sentences.”
Low confirmation_check	Add: “Before executing, summarize details and ask for confirmation.”
Low missing_info_followup	Add: “If any required field is missing, ask for it. Do not assume.”
Low tone on negative outcomes	Add empathy instructions for bad-news scenarios
Low Completeness on simple prompts	Metric/data issue â add `referenceResponse` or filter from Completeness

Metric Refinement

High N/A rates (>60%) â metric too narrowly scoped. Split dataset or adjust scope.
All-high scores â instructions too lenient. Add specific failure criteria.
Inconsistent scoring â instructions ambiguous. Add concrete examples per rating level.

Run Comparison

Run 1 (baseline):    response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes):  response_brevity avg=0.85, custom_tone avg=0.90

Track scores over time. The pipeline’s value comes from repeated measurement.

Gotchas

taskType must be "General" â not “Generation” or any other value. The job fails silently with other values.
Custom metric names in BOTH places â must appear in metricNames array AND customMetrics array. Missing from metricNames = silently ignored. Missing from customMetrics = job fails.

null result means N/A, not 0 â when the judge determines a metric doesn’t apply, Bedrock records null:

// WRONG â treats N/A as 0
const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length;

// RIGHT â excludes N/A from average
const numericScores = scores.filter((s): s is number => s !== null);
const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length;

evaluatorModelConfig appears twice â once at top level (built-in metrics), once inside customMetricConfig (custom metrics). Omitting either causes those metrics to fail.
modelIdentifier must match exactly â the string in JSONL modelResponses must be character-for-character identical to inferenceSourceIdentifier in inference-config.json. Mismatch = model mapping error.
AWS CLI 2.33+ required â older versions silently drop customMetricConfig and precomputedInferenceSource. Job creation succeeds but the job fails. Always check aws --version.
Job names: lowercase + hyphens, max 63 chars â pattern: [a-z0-9](-*[a-z0-9]){0,62}. Must be unique across all jobs. Use timestamps: --job-name "my-eval-$(date +%Y%m%d-%H%M)".
S3 output is deeply nested â <prefix>/<job-name>/<job-name>/<random-id>/models/.... Use aws s3 sync and search for _output.jsonl. Do not construct paths manually.
referenceResponse improves Correctness/Completeness â empty string is valid, but providing reference responses gives the judge a baseline for comparison.
<thinking> tag leakage (model-specific) â some models (e.g., Amazon Nova Lite) may leak <thinking>...</thinking> blocks into responses. If present, strip before writing JSONL:
```
const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim();
```
us-east-1 S3 bucket creation â do NOT pass LocationConstraint for us-east-1. Other regions require it.

Cost Estimation

Formula:

Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price

Example: 30 prompts, 10 metrics, Nova Pro judge:

Response collection (Nova Lite): ~$0.02
Evaluation job (Nova Pro): ~$0.58
Total per run: ~$0.61

Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics â $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).

References

Model Evaluation Metrics â all 11 built-in metrics
Custom Metrics Prompt Formats â metricName, template variables, constraints
Prompt Datasets for Judge Evaluation â dataset JSONL format
CreateEvaluationJob API Reference â full API spec
AWS CLI create-evaluation-job â CLI command reference
Amazon Bedrock Pricing â model pricing

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台