evalsense

📁 mohitjoshi14/evalsense 📅 8 days ago
2
总安装量
2
周安装量
#74008
全站排名
安装命令
npx skills add https://github.com/mohitjoshi14/evalsense --skill evalsense

Agent 安装分布

opencode 2
gemini-cli 2
claude-code 2
github-copilot 2
codex 2
kimi-cli 2

Skill 文档

evalsense

Create, run, and validate an evalsense eval for the recently built LLM feature. No shipping until every assertion passes.

When to use

Use this skill after building or modifying an LLM-powered feature to validate it meets statistical quality thresholds before shipping. Invoke it by saying “run the quality gate”, “eval this feature”, or “use evalsense”.

Step 1 — Understand the Feature

Read recently changed files to identify what the feature outputs and what “correct” looks like. If unclear, ask the user one focused question before continuing.

Step 2 — Ensure evalsense is Available

npx evalsense docs

If that fails, install it: npm install --save-dev evalsense, then re-run. The docs command prints the full assertion API — read it before writing the eval file.

Step 3 — Create the Eval File

Create <feature>.eval.js next to the feature or in tests/. Use the API from npx evalsense docs to choose the right metrics and matchers for the feature’s output type.

import { describe, evalTest, expectStats } from "evalsense";

describe("<Feature Name>", () => {
  evalTest("<what is being tested>", async () => {
    // Every record MUST have an `id` field. Minimum 10 records.
    // Cover: typical inputs, edge cases, adversarial, empty/null.
    const groundTruth = [{ id: "1", input: "...", expected_field: "<label or value>" }];

    const predictions = await Promise.all(
      groundTruth.map(async (record) => ({
        id: record.id,
        predicted_field: await yourFeature(record.input),
      }))
    );

    // Choose metrics based on output type (see npx evalsense docs for full API):
    //   Classification → .accuracy, .precision("class"), .recall("class"), .f1
    //   Regression     → .mae, .rmse, .r2
    //   Scores         → .percentageAbove(threshold)
    //   No ground truth → LLM-as-judge via evalsense/metrics/opinionated
    expectStats(predictions, groundTruth).field("predicted_field").accuracy.toBeAtLeast(0.85);
  });
});

Step 4 — Run and Decide

Check available output flags, then run with JSON output for reliable parsing:

npx evalsense run --help
npx evalsense run -r json -o report.json

Read report.json. If all assertions pass, tell the user the feature is cleared to ship. If any fail, report which threshold was not met and suggest next steps (improve model, adjust prompts, expand data). Do not clear the feature until every assertion passes.