aws-bedrock-evals
npx skills add https://github.com/antstackio/skills --skill aws-bedrock-evals
Agent 安装分布
Skill 文档
AWS Bedrock Evaluation Jobs
Overview
Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the “judge”) to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.
Pre-computed Inference vs Live Inference
| Mode | How it works | Use when |
|---|---|---|
| Live Inference | Bedrock generates responses during the eval job | Simple prompt-in/text-out, no tool calling |
| Pre-computed Inference | You pre-collect responses and supply them in a JSONL dataset | Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock |
Use pre-computed inference when your application involves tool use, agent loops, multi-turn state, or external orchestration.
Pipeline
Design Scenarios â Collect Responses â Upload to S3 â Run Eval Job â Parse Results â Act on Findings
| | | | | |
scenarios.json Your app's API s3://bucket/ create- s3 sync + Fix prompt,
(multi-turn) â dataset JSONL datasets/ evaluation-job parse JSONL retune metrics
Agent Behavior: Gather Inputs and Show Cost Estimate
Before generating any configs, scripts, or artifacts, you MUST gather the following from the user:
- AWS Region â Which region to use (default:
us-east-1). Affects model availability and pricing. - Target model â The model their application uses (e.g.,
amazon.nova-lite-v1:0,anthropic.claude-3-haiku). - Evaluator (judge) model â The model to score responses (e.g.,
amazon.nova-pro-v1:0). Should be at least as capable as the target. - Application type â Brief description of what the app does. Used to design test scenarios and derive custom metrics.
- Number of test scenarios â How many they plan to test (recommend 13-20 for first run).
- Estimated JSONL entries â Derived from scenarios x avg turns per scenario.
- Number of metrics â Total (built-in + custom). Recommend starting with 6 built-in + 3-5 custom.
- S3 bucket â Existing bucket name or confirm creation of a new one.
- IAM role â Existing role ARN or confirm creation of a new one.
Cost Estimate
After gathering inputs, you MUST display a cost estimate before proceeding:
## Estimated Cost Summary
| Item | Details | Est. Cost |
|------|---------|-----------|
| Response collection | {N} prompts x ~{T} tokens x {target_model_price} | ${X.XX} |
| Evaluation job | {N} prompts x {M} metrics x ~1,700 tokens x {judge_model_price} | ${X.XX} |
| S3 storage | < 1 MB | < $0.01 |
| **Total per run** | | **~${X.XX}** |
Scaling: Each additional run costs ~${X.XX}. Adding 1 custom metric adds ~${Y.YY}/run.
Cost formulas:
- Response collection:
num_prompts x avg_input_tokens x input_price + num_prompts x avg_output_tokens x output_price - Evaluation job:
num_prompts x num_metrics x ~1,500 input_tokens x judge_input_price + num_prompts x num_metrics x ~200 output_tokens x judge_output_price
Model pricing reference:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| amazon.nova-lite-v1:0 | $0.06 | $0.24 |
| amazon.nova-pro-v1:0 | $0.80 | $3.20 |
| anthropic.claude-3-haiku | $0.25 | $1.25 |
| anthropic.claude-3-sonnet | $3.00 | $15.00 |
Prerequisites
# AWS CLI 2.33+ required (older versions silently drop customMetricConfig/precomputedInferenceSource fields)
aws --version
# Verify target model access
aws bedrock get-foundation-model --model-identifier "TARGET_MODEL_ID" --region REGION
# Verify evaluator model access
aws bedrock get-foundation-model --model-identifier "EVALUATOR_MODEL_ID" --region REGION
Good evaluator model choices: amazon.nova-pro-v1:0, anthropic.claude-3-sonnet, anthropic.claude-3-haiku. The evaluator should be at least as capable as your target model.
Step 1: Design Test Scenarios
List the application’s functional areas (e.g., greeting, booking-flow, error-handling, etc.). Each category should have 2-4 scenarios covering happy path and edge cases.
Scenario JSON format:
[
{
"id": "greeting-known-user",
"category": "greeting",
"context": { "userId": "user-123" },
"turns": ["hello"]
},
{
"id": "multi-step-flow",
"category": "core-flow",
"context": { "userId": "user-456" },
"turns": [
"hello",
"I need help with X",
"yes, proceed with that",
"thanks"
]
}
]
The context field holds any session/user data your app needs. Each turn in the array is one user message; the collection step handles the multi-turn conversation loop.
Edge case coverage dimensions:
- Happy path: standard usage that should work perfectly
- Missing information: user omits required fields
- Unavailable resources: requested item doesn’t exist
- Out-of-scope requests: user asks something the app shouldn’t handle
- Error recovery: bad input, invalid data
- Tone stress tests: complaints, frustration
Recommended count: 13-20 scenarios producing 30-50 JSONL entries (multi-turn scenarios produce one entry per turn).
Step 2: Collect Responses
Collect responses from your application however it runs. The goal is to produce a JSONL dataset file where each line contains the prompt, the model’s response, and metadata.
Example pattern: Converse API with tool-calling loop (TypeScript)
This applies when your application uses Bedrock with tool calling:
import {
BedrockRuntimeClient,
ConverseCommand,
type Message,
type SystemContentBlock,
} from "@aws-sdk/client-bedrock-runtime";
const client = new BedrockRuntimeClient({ region: "us-east-1" });
async function converseLoop(
messages: Message[],
systemPrompt: SystemContentBlock[],
tools: any[]
): Promise<string> {
const MAX_TOOL_ROUNDS = 10;
for (let round = 0; round < MAX_TOOL_ROUNDS; round++) {
const response = await client.send(
new ConverseCommand({
modelId: "TARGET_MODEL_ID",
system: systemPrompt,
messages,
toolConfig: { tools },
inferenceConfig: { maxTokens: 1024, topP: 0.9, temperature: 0.7 },
})
);
const assistantContent = response.output?.message?.content as any[];
if (!assistantContent) return "[No response from model]";
messages.push({ role: "assistant", content: assistantContent });
const toolUseBlocks = assistantContent.filter(
(block: any) => block.toolUse != null
);
if (toolUseBlocks.length === 0) {
return assistantContent
.filter((block: any) => block.text != null)
.map((block: any) => block.text as string)
.join("\n") || "[Empty response]";
}
const toolResultBlocks: any[] = [];
for (const block of toolUseBlocks) {
const { toolUseId, name, input } = block.toolUse;
const result = await executeTool(name, input);
toolResultBlocks.push({
toolResult: { toolUseId, content: [{ json: result }] },
});
}
messages.push({ role: "user", content: toolResultBlocks } as Message);
}
return "[Max tool rounds exceeded]";
}
Multi-turn handling: Maintain the messages array across turns and build the dataset prompt field with conversation history:
const messages: Message[] = [];
const conversationHistory: { role: string; text: string }[] = [];
for (let i = 0; i < scenario.turns.length; i++) {
const userTurn = scenario.turns[i];
messages.push({ role: "user", content: [{ text: userTurn }] });
const assistantText = await converseLoop(messages, systemPrompt, tools);
conversationHistory.push({ role: "user", text: userTurn });
conversationHistory.push({ role: "assistant", text: assistantText });
let prompt: string;
if (i === 0) {
prompt = userTurn;
} else {
prompt = conversationHistory
.map((m) => `${m.role === "user" ? "User" : "Assistant"}: ${m.text}`)
.join("\n");
}
entries.push({
prompt,
category: scenario.category,
referenceResponse: "",
modelResponses: [
{ response: assistantText, modelIdentifier: "my-app-v1" },
],
});
}
Dataset JSONL Format
Each line must have this structure:
{
"prompt": "User question or multi-turn history",
"referenceResponse": "",
"modelResponses": [
{
"response": "The model's actual output text",
"modelIdentifier": "my-app-v1"
}
]
}
| Field | Required | Notes |
|---|---|---|
prompt |
Yes | User input. For multi-turn, concatenate: User: ...\nAssistant: ...\nUser: ... |
referenceResponse |
No | Expected/ideal response. Can be empty string. Needed for Builtin.Correctness and Builtin.Completeness to work properly. Maps to {{ground_truth}} template variable |
modelResponses |
Yes | Array with exactly one entry for pre-computed inference |
modelResponses[0].response |
Yes | The model’s actual output text |
modelResponses[0].modelIdentifier |
Yes | Any string label. Must match inferenceSourceIdentifier in inference-config.json |
Constraints: One model response per prompt. One unique modelIdentifier per job. Max 1000 prompts per job.
Write JSONL:
const jsonl = entries.map((e) => JSON.stringify(e)).join("\n") + "\n";
writeFileSync("datasets/collected-responses.jsonl", jsonl, "utf-8");
Step 3: Design Metrics
Built-In Metrics
Bedrock provides 11 built-in metrics requiring no configuration beyond listing them by name:
| Metric Name | What It Measures |
|---|---|
Builtin.Correctness |
Is the factual content accurate? (works best with referenceResponse) |
Builtin.Completeness |
Does the response fully cover the request? (works best with referenceResponse) |
Builtin.Faithfulness |
Is the response faithful to the provided context/source? |
Builtin.Helpfulness |
Is the response useful, actionable, and cooperative? |
Builtin.Coherence |
Is the response logically structured and easy to follow? |
Builtin.Relevance |
Does the response address the actual question? |
Builtin.FollowingInstructions |
Does the response follow explicit instructions in the prompt? |
Builtin.ProfessionalStyleAndTone |
Is spelling, grammar, and tone appropriate? |
Builtin.Harmfulness |
Does the response contain harmful content? |
Builtin.Stereotyping |
Does the response contain stereotypes or bias? |
Builtin.Refusal |
Does the response appropriately refuse harmful requests? |
Score interpretation: 1.0 = best, 0.0 = worst, null = N/A (judge could not evaluate).
Note: referenceResponse is needed for Builtin.Correctness and Builtin.Completeness to produce meaningful scores, since the judge compares against a reference baseline.
When to Use Custom Metrics
Use custom metrics to check domain-specific behaviors the built-in metrics don’t cover. If you find yourself thinking “this scored well on Helpfulness but violated a critical business rule” â that’s a custom metric.
Technique: Extract rules from your system prompt. Every rule in your system prompt is a candidate metric:
System prompt says: Candidate metric:
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
"Keep responses to 2-3 sentences max" â response_brevity
"Always greet returning users by name" â personalized_greeting
"Never proceed without user confirmation" â confirmation_check
"Ask for missing details, don't assume" â missing_info_followup
Custom Metric JSON Anatomy
{
"customMetricDefinition": {
"metricName": "my_metric_name",
"instructions": "You are evaluating ... \n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
| Field | Details |
|---|---|
metricName |
Snake_case identifier. Must appear in BOTH customMetrics array AND metricNames array |
instructions |
Full prompt sent to the judge. Must include {{prompt}} and {{prediction}} template variables. Can also use {{ground_truth}} (maps to referenceResponse). Input variables must come last in the prompt. |
ratingScale |
Array of rating levels. Each has a definition (label, max 5 words / 100 chars) and value with either floatValue or stringValue |
Official constraints:
- Max 10 custom metrics per job
- Instructions max 5000 characters
- Rating
definitionmax 5 words / 100 characters - Input variables (
{{prompt}},{{prediction}},{{ground_truth}}) must come last in the instruction text
Complete Custom Metric Example
A metric that checks whether the assistant follows a domain-specific rule, with N/A handling for irrelevant prompts:
{
"customMetricDefinition": {
"metricName": "confirmation_check",
"instructions": "You are evaluating an assistant application. A critical rule: the assistant must NEVER finalize a consequential action without first asking the user for explicit confirmation. Before executing, it must summarize details and ask something like 'Shall I go ahead?'.\n\nIf the conversation does not involve any consequential action, rate as 'Not Applicable'.\n\n- Not Applicable: No consequential action in this response\n- Poor: Proceeds with action without asking for confirmation\n- Good: Asks for confirmation before finalizing the action\n\nPrompt: {{prompt}}\nResponse: {{prediction}}",
"ratingScale": [
{ "definition": "N/A", "value": { "floatValue": -1 } },
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
When the judge selects N/A (floatValue: -1), Bedrock records "result": null. Your parser must handle null â treat as N/A and exclude from averages.
Rating Scale Design
- 3-4 levels for quality scales (Poor/Acceptable/Good/Excellent)
- 2 levels for binary checks (Poor/Good)
- Add “N/A” level with
-1for conditional metrics that only apply to certain prompt types - Rating values can use
floatValue(numeric) orstringValue(text)
Tips for Writing Metric Instructions
- Be explicit about what “good” and “bad” look like â include examples of phrases or behaviors
- For conditional metrics, describe the N/A condition clearly so the judge doesn’t score 0 when it should skip
- Keep instructions under ~500 words to fit within context alongside prompt and response
- Test with a few examples before running a full eval job
Step 4: AWS Infrastructure
S3 Bucket
REGION="us-east-1"
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
BUCKET_NAME="my-eval-${ACCOUNT_ID}-${REGION}"
# us-east-1 does not accept LocationConstraint
if [ "${REGION}" = "us-east-1" ]; then
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}"
else
aws s3api create-bucket --bucket "${BUCKET_NAME}" --region "${REGION}" \
--create-bucket-configuration LocationConstraint="${REGION}"
fi
Upload the dataset:
aws s3 cp datasets/collected-responses.jsonl \
"s3://${BUCKET_NAME}/datasets/collected-responses.jsonl"
IAM Role
Trust policy (must include aws:SourceAccount condition â Bedrock rejects the role without it):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": { "Service": "bedrock.amazonaws.com" },
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"aws:SourceAccount": "YOUR_ACCOUNT_ID"
}
}
}
]
}
Permissions policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "S3DatasetRead",
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::YOUR_BUCKET",
"arn:aws:s3:::YOUR_BUCKET/datasets/*"
]
},
{
"Sid": "S3ResultsWrite",
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": ["arn:aws:s3:::YOUR_BUCKET/results/*"]
},
{
"Sid": "BedrockModelInvoke",
"Effect": "Allow",
"Action": ["bedrock:InvokeModel"],
"Resource": [
"arn:aws:bedrock:REGION::foundation-model/EVALUATOR_MODEL_ID"
]
}
]
}
Replace YOUR_BUCKET, REGION, and EVALUATOR_MODEL_ID with actual values.
Create the role:
ROLE_NAME="BedrockEvalRole"
ROLE_ARN=$(aws iam create-role \
--role-name "${ROLE_NAME}" \
--assume-role-policy-document file://trust-policy.json \
--description "Allows Bedrock to run evaluation jobs" \
--query "Role.Arn" --output text)
aws iam put-role-policy \
--role-name "${ROLE_NAME}" \
--policy-name "BedrockEvalPolicy" \
--policy-document file://permissions-policy.json
Step 5: Configure and Run Eval Job
eval-config.json
{
"automated": {
"datasetMetricConfigs": [
{
"taskType": "General",
"dataset": {
"name": "my-eval-dataset",
"datasetLocation": {
"s3Uri": "s3://YOUR_BUCKET/datasets/collected-responses.jsonl"
}
},
"metricNames": [
"Builtin.Helpfulness",
"Builtin.FollowingInstructions",
"Builtin.ProfessionalStyleAndTone",
"Builtin.Relevance",
"Builtin.Completeness",
"Builtin.Correctness",
"my_custom_metric_1",
"my_custom_metric_2"
]
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
},
"customMetricConfig": {
"customMetrics": [
{
"customMetricDefinition": {
"metricName": "my_custom_metric_1",
"instructions": "... {{prompt}} ... {{prediction}} ...",
"ratingScale": [
{ "definition": "Poor", "value": { "floatValue": 0 } },
{ "definition": "Good", "value": { "floatValue": 1 } }
]
}
}
],
"evaluatorModelConfig": {
"bedrockEvaluatorModels": [
{ "modelIdentifier": "EVALUATOR_MODEL_ID" }
]
}
}
}
}
Critical structure notes:
taskTypemust be"General"(not “Generation” or any other value)- Custom metric names must appear in both
metricNamesarray ANDcustomMetricsarray evaluatorModelConfigappears twice: once at the top level (for built-in metrics) and once insidecustomMetricConfig(for custom metrics) â both must specify the same evaluator modelmodelIdentifiermust be the exact model ID string matching across all configs
inference-config.json
For pre-computed inference, this tells Bedrock that responses are already collected:
{
"models": [
{
"precomputedInferenceSource": {
"inferenceSourceIdentifier": "my-app-v1"
}
}
]
}
The inferenceSourceIdentifier must match the modelIdentifier in your JSONL dataset’s modelResponses.
Running the Job
aws bedrock create-evaluation-job \
--job-name "my-eval-$(date +%Y%m%d-%H%M)" \
--role-arn "${ROLE_ARN}" \
--evaluation-config file://eval-config.json \
--inference-config file://inference-config.json \
--output-data-config '{"s3Uri": "s3://YOUR_BUCKET/results/"}' \
--region us-east-1
CLI notes:
- Required params:
--job-name,--role-arn,--evaluation-config,--inference-config,--output-data-config - Optional:
--application-type(e.g.,ModelEvaluation) --job-nameconstraint:[a-z0-9](-*[a-z0-9]){0,62}â lowercase + hyphens only, max 63 chars. Must be unique (use timestamps).--evaluation-configand--inference-configare document types â must usefile://or inline JSON, no shorthand syntax--output-data-configis a structure â supports both inline JSON and shorthand (s3Uri=string)
Monitoring
# List evaluation jobs (with optional filters)
aws bedrock list-evaluation-jobs --region us-east-1
aws bedrock list-evaluation-jobs --status-equals Completed --region us-east-1
aws bedrock list-evaluation-jobs --name-contains "my-eval" --region us-east-1
# Get details for a specific job
aws bedrock get-evaluation-job \
--job-identifier "JOB_ARN" \
--region us-east-1
# Cancel a running job
aws bedrock stop-evaluation-job \
--job-identifier "JOB_ARN" \
--region us-east-1
Job statuses: InProgress, Completed, Failed, Stopping, Stopped, Deleting
Jobs typically take 5-15 minutes for 30-50 entry datasets. If a job fails, check failureMessages in the job details.
Step 6: Parse Results
S3 Output Directory Structure
Bedrock writes results to a deeply nested path:
s3://YOUR_BUCKET/results/
âââ <job-name>/
âââ <job-name>/
âââ amazon-bedrock-evaluations-permission-check â empty sentinel
âââ <random-id>/
âââ custom_metrics/ â metric definitions (NOT results)
âââ models/
âââ <model-identifier>/
âââ taskTypes/General/datasets/<dataset-name>/
âââ <uuid>_output.jsonl â actual results
The job name is repeated twice. The random ID changes every run. Use aws s3 sync â do not construct paths manually.
Download Results
aws s3 sync "s3://YOUR_BUCKET/results/<job-name>" "./results/<job-name>" --region us-east-1
Result JSONL Format
Each line:
{
"automatedEvaluationResult": {
"scores": [
{
"metricName": "Builtin.Helpfulness",
"result": 0.6667,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "The response provides useful information..."
}
]
},
{
"metricName": "confirmation_check",
"result": null,
"evaluatorDetails": [
{
"modelIdentifier": "amazon.nova-pro-v1:0",
"explanation": "This conversation does not involve any consequential action..."
}
]
}
]
},
"inputRecord": {
"prompt": "hello",
"referenceResponse": "",
"modelResponses": [
{ "response": "Hello! How may I assist you?", "modelIdentifier": "my-app-v1" }
]
}
}
resultis a number (score) ornull(N/A)evaluatorDetails[0].explanationcontains the judge’s written reasoning
Parsing and Aggregation
interface PromptResult {
prompt: string;
category: string;
modelResponse: string;
scores: Record<string, {
score: string;
reasoning?: string;
rawScore?: number;
}>;
}
for (const s of entry.automatedEvaluationResult.scores) {
scores[s.metricName] = {
score: s.result === null ? "N/A" : String(s.result),
reasoning: s.evaluatorDetails?.[0]?.explanation,
rawScore: typeof s.result === "number" ? s.result : undefined,
};
}
Aggregation approach:
- Overall averages per metric â exclude N/A entries
- Per-category breakdown â group by category field, compute averages within each
- Low-score alerts â flag entries below threshold (built-in < 0.5, custom <= 0)
Low-score alert format:
[Builtin.Relevance] score=0.50 | "hello..."
Reason: The response does not directly address the greeting...
[confirmation_check] score=0.00 | "User: proceed with X..."
Reason: The assistant executed the action without asking for confirmation...
Step 7: Eval-Fix-Reeval Loop
Common Fixes
| Finding | Fix |
|---|---|
| Low brevity scores | Add hard constraint: “Respond in no more than 3 sentences.” |
| Low confirmation_check | Add: “Before executing, summarize details and ask for confirmation.” |
| Low missing_info_followup | Add: “If any required field is missing, ask for it. Do not assume.” |
| Low tone on negative outcomes | Add empathy instructions for bad-news scenarios |
| Low Completeness on simple prompts | Metric/data issue â add referenceResponse or filter from Completeness |
Metric Refinement
- High N/A rates (>60%) â metric too narrowly scoped. Split dataset or adjust scope.
- All-high scores â instructions too lenient. Add specific failure criteria.
- Inconsistent scoring â instructions ambiguous. Add concrete examples per rating level.
Run Comparison
Run 1 (baseline): response_brevity avg=0.42, custom_tone avg=0.80
Run 2 (post-fixes): response_brevity avg=0.85, custom_tone avg=0.90
Track scores over time. The pipeline’s value comes from repeated measurement.
Gotchas
-
taskTypemust be"General"â not “Generation” or any other value. The job fails silently with other values. -
Custom metric names in BOTH places â must appear in
metricNamesarray ANDcustomMetricsarray. Missing frommetricNames= silently ignored. Missing fromcustomMetrics= job fails. -
nullresult means N/A, not 0 â when the judge determines a metric doesn’t apply, Bedrock recordsnull:// WRONG â treats N/A as 0 const avg = scores.reduce((a, b) => a + (b ?? 0), 0) / scores.length; // RIGHT â excludes N/A from average const numericScores = scores.filter((s): s is number => s !== null); const avg = numericScores.reduce((a, b) => a + b, 0) / numericScores.length; -
evaluatorModelConfigappears twice â once at top level (built-in metrics), once insidecustomMetricConfig(custom metrics). Omitting either causes those metrics to fail. -
modelIdentifiermust match exactly â the string in JSONLmodelResponsesmust be character-for-character identical toinferenceSourceIdentifierin inference-config.json. Mismatch = model mapping error. -
AWS CLI 2.33+ required â older versions silently drop
customMetricConfigandprecomputedInferenceSource. Job creation succeeds but the job fails. Always checkaws --version. -
Job names: lowercase + hyphens, max 63 chars â pattern:
[a-z0-9](-*[a-z0-9]){0,62}. Must be unique across all jobs. Use timestamps:--job-name "my-eval-$(date +%Y%m%d-%H%M)". -
S3 output is deeply nested â
<prefix>/<job-name>/<job-name>/<random-id>/models/.... Useaws s3 syncand search for_output.jsonl. Do not construct paths manually. -
referenceResponseimproves Correctness/Completeness â empty string is valid, but providing reference responses gives the judge a baseline for comparison. -
<thinking>tag leakage (model-specific) â some models (e.g., Amazon Nova Lite) may leak<thinking>...</thinking>blocks into responses. If present, strip before writing JSONL:const clean = raw.replace(/<thinking>[\s\S]*?<\/thinking>/g, "").trim(); -
us-east-1 S3 bucket creation â do NOT pass
LocationConstraintforus-east-1. Other regions require it.
Cost Estimation
Formula:
Total = response_collection_cost + judge_cost
Judge cost = num_prompts x num_metrics x (~1,500 input + ~200 output tokens) x judge_price
Example: 30 prompts, 10 metrics, Nova Pro judge:
- Response collection (Nova Lite): ~$0.02
- Evaluation job (Nova Pro): ~$0.58
- Total per run: ~$0.61
Scaling: Cost is linear with prompts and metrics. 100 prompts x 10 metrics â $5. Judge cost dominates at ~95%. Adding 1 custom metric adds ~$0.06/run (30 prompts, Nova Pro).
References
- Model Evaluation Metrics â all 11 built-in metrics
- Custom Metrics Prompt Formats â
metricName, template variables, constraints - Prompt Datasets for Judge Evaluation â dataset JSONL format
- CreateEvaluationJob API Reference â full API spec
- AWS CLI create-evaluation-job â CLI command reference
- Amazon Bedrock Pricing â model pricing