skill-test

📁 databricks-solutions/ai-dev-kit 📅 1 day ago

总安装量

周安装量

#43053

全站排名

安装命令

npx skills add https://github.com/databricks-solutions/ai-dev-kit --skill skill-test

Agent 安装分布

amp 1

opencode 1

kimi-cli 1

codex 1

github-copilot 1

gemini-cli 1

Skill 文档

Databricks Skills Testing Framework

Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.

Quick References

Scorers – Available scorers and quality gates
YAML Schemas – Manifest and ground truth formats
Python API – Programmatic usage examples
Workflows – Detailed example workflows
Trace Evaluation – Session trace analysis

/skill-test Command

The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.

Basic Usage

/skill-test <skill-name> [subcommand]

Subcommands

Subcommand	Description
`run`	Run evaluation against ground truth (default)
`regression`	Compare current results against baseline
`init`	Initialize test scaffolding for a new skill
`add`	Interactive: prompt -> invoke skill -> test -> save
`add --trace`	Add test case with trace evaluation
`review`	Review pending candidates interactively
`review --batch`	Batch approve all pending candidates
`baseline`	Save current results as regression baseline
`mlflow`	Run full MLflow evaluation with LLM judges
`trace-eval`	Evaluate traces against skill expectations
`list-traces`	List available traces (MLflow or local)
`scorers`	List configured scorers for a skill
`scorers update`	Add/remove scorers or update default guidelines
`sync`	Sync YAML to Unity Catalog (Phase 2)

Quick Examples

/skill-test spark-declarative-pipelines run
/skill-test spark-declarative-pipelines add --trace
/skill-test spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init

See Workflows for detailed examples of each subcommand.

Execution Instructions

Environment Setup

uv pip install -e .test/

Environment variables for Databricks MLflow:

DATABRICKS_CONFIG_PROFILE – Databricks CLI profile (default: “DEFAULT”)
MLFLOW_TRACKING_URI – Set to “databricks” for Databricks MLflow
MLFLOW_EXPERIMENT_NAME – Experiment path (e.g., “/Users/{user}/skill-test”)

Running Scripts

All subcommands have corresponding scripts in .test/scripts/:

uv run python .test/scripts/{subcommand}.py {skill_name} [options]

Subcommand	Script
`run`	`run_eval.py`
`regression`	`regression.py`
`init`	`init_skill.py`
`add`	`add.py`
`review`	`review.py`
`baseline`	`baseline.py`
`mlflow`	`mlflow_eval.py`
`scorers`	`scorers.py`
`scorers update`	`scorers_update.py`
`sync`	`sync.py`
`trace-eval`	`trace_eval.py`
`list-traces`	`list_traces.py`
`_routing mlflow`	`routing_eval.py`

Use --help on any script for available options.

Command Handler

When /skill-test is invoked, parse arguments and execute the appropriate command.

Argument Parsing

args[0] = skill_name (required)
args[1] = subcommand (optional, default: “run”)

Subcommand Routing

Subcommand	Action
`run`	Execute `run(skill_name, ctx)` and display results
`regression`	Execute `regression(skill_name, ctx)` and display comparison
`init`	Execute `init(skill_name, ctx)` to create scaffolding
`add`	Prompt for test input, invoke skill, run `interactive()`
`review`	Execute `review(skill_name, ctx)` to review pending candidates
`baseline`	Execute `baseline(skill_name, ctx)` to save as regression baseline
`mlflow`	Execute `mlflow_eval(skill_name, ctx)` with MLflow logging
`scorers`	Execute `scorers(skill_name, ctx)` to list configured scorers
`scorers update`	Execute `scorers_update(skill_name, ctx, ...)` to modify scorers

init Behavior

When running /skill-test <skill-name> init:

Read the skill’s SKILL.md to understand its purpose
Create manifest.yaml with appropriate scorers and trace_expectations
Create empty ground_truth.yaml and candidates.yaml templates
Recommend test prompts based on documentation examples

Follow with /skill-test <skill-name> add using recommended prompts.

Context Setup

Create CLIContext with MCP tools before calling any command. See Python API for details.

File Locations

Important: All test files are stored at the repository root level, not relative to this skill’s directory.

File Type	Path
Ground truth	`{repo_root}/.test/skills/{skill-name}/ground_truth.yaml`
Candidates	`{repo_root}/.test/skills/{skill-name}/candidates.yaml`
Manifest	`{repo_root}/.test/skills/{skill-name}/manifest.yaml`
Routing tests	`{repo_root}/.test/skills/_routing/ground_truth.yaml`
Baselines	`{repo_root}/.test/baselines/{skill-name}/baseline.yaml`

For example, to test spark-declarative-pipelines in this repository:

/Users/.../ai-dev-kit/.test/skills/spark-declarative-pipelines/ground_truth.yaml

Not relative to the skill definition:

/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/...  # WRONG

Directory Structure

.test/                          # At REPOSITORY ROOT (not skill directory)
âââ pyproject.toml              # Package config (pip install -e ".test/")
âââ README.md                   # Contributor documentation
âââ SKILL.md                    # Source of truth (synced to .claude/skills/)
âââ install_skill_test.sh       # Sync script
âââ scripts/                    # Wrapper scripts
â   âââ _common.py              # Shared utilities
â   âââ run_eval.py
â   âââ regression.py
â   âââ init_skill.py
â   âââ add.py
â   âââ baseline.py
â   âââ mlflow_eval.py
â   âââ routing_eval.py
â   âââ trace_eval.py           # Trace evaluation
â   âââ list_traces.py          # List available traces
â   âââ scorers.py
â   âââ scorers_update.py
â   âââ sync.py
âââ src/
â   âââ skill_test/             # Python package
â       âââ cli/                # CLI commands module
â       âââ fixtures/           # Test fixture setup
â       âââ scorers/            # Evaluation scorers
â       âââ grp/                # Generate-Review-Promote pipeline
â       âââ runners/            # Evaluation runners
âââ skills/                     # Per-skill test definitions
â   âââ _routing/               # Routing test cases
â   âââ {skill-name}/           # Skill-specific tests
â       âââ ground_truth.yaml
â       âââ candidates.yaml
â       âââ manifest.yaml
âââ tests/                      # Unit tests
âââ references/                 # Documentation references
âââ baselines/                  # Regression baselines

References

Scorers – Available scorers and quality gates
YAML Schemas – Manifest and ground truth formats
Python API – Programmatic usage examples
Workflows – Detailed example workflows
Trace Evaluation – Session trace analysis

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台