evalsense
npx skills add https://github.com/mohitjoshi14/evalsense --skill evalsense
Agent 安装分布
Skill 文档
evalsense
Create, run, and validate an evalsense eval for the recently built LLM feature. No shipping until every assertion passes.
When to use
Use this skill after building or modifying an LLM-powered feature to validate it meets statistical quality thresholds before shipping. Invoke it by saying “run the quality gate”, “eval this feature”, or “use evalsense”.
Step 1 â Understand the Feature
Read recently changed files to identify what the feature outputs and what “correct” looks like. If unclear, ask the user one focused question before continuing.
Step 2 â Ensure evalsense is Available
npx evalsense docs
If that fails, install it: npm install --save-dev evalsense, then re-run. The docs command prints the full assertion API â read it before writing the eval file.
Step 3 â Create the Eval File
Create <feature>.eval.js next to the feature or in tests/. Use the API from npx evalsense docs to choose the right metrics and matchers for the feature’s output type.
import { describe, evalTest, expectStats } from "evalsense";
describe("<Feature Name>", () => {
evalTest("<what is being tested>", async () => {
// Every record MUST have an `id` field. Minimum 10 records.
// Cover: typical inputs, edge cases, adversarial, empty/null.
const groundTruth = [{ id: "1", input: "...", expected_field: "<label or value>" }];
const predictions = await Promise.all(
groundTruth.map(async (record) => ({
id: record.id,
predicted_field: await yourFeature(record.input),
}))
);
// Choose metrics based on output type (see npx evalsense docs for full API):
// Classification â .accuracy, .precision("class"), .recall("class"), .f1
// Regression â .mae, .rmse, .r2
// Scores â .percentageAbove(threshold)
// No ground truth â LLM-as-judge via evalsense/metrics/opinionated
expectStats(predictions, groundTruth).field("predicted_field").accuracy.toBeAtLeast(0.85);
});
});
Step 4 â Run and Decide
Check available output flags, then run with JSON output for reliable parsing:
npx evalsense run --help
npx evalsense run -r json -o report.json
Read report.json. If all assertions pass, tell the user the feature is cleared to ship. If any fail, report which threshold was not met and suggest next steps (improve model, adjust prompts, expand data). Do not clear the feature until every assertion passes.