skill-test
npx skills add https://github.com/databricks-solutions/ai-dev-kit --skill skill-test
Agent 安装分布
Skill 文档
Databricks Skills Testing Framework
Offline YAML-first evaluation with human-in-the-loop review and interactive skill improvement.
Quick References
- Scorers – Available scorers and quality gates
- YAML Schemas – Manifest and ground truth formats
- Python API – Programmatic usage examples
- Workflows – Detailed example workflows
- Trace Evaluation – Session trace analysis
/skill-test Command
The /skill-test command provides an interactive CLI for testing Databricks skills with real execution on Databricks.
Basic Usage
/skill-test <skill-name> [subcommand]
Subcommands
| Subcommand | Description |
|---|---|
run |
Run evaluation against ground truth (default) |
regression |
Compare current results against baseline |
init |
Initialize test scaffolding for a new skill |
add |
Interactive: prompt -> invoke skill -> test -> save |
add --trace |
Add test case with trace evaluation |
review |
Review pending candidates interactively |
review --batch |
Batch approve all pending candidates |
baseline |
Save current results as regression baseline |
mlflow |
Run full MLflow evaluation with LLM judges |
trace-eval |
Evaluate traces against skill expectations |
list-traces |
List available traces (MLflow or local) |
scorers |
List configured scorers for a skill |
scorers update |
Add/remove scorers or update default guidelines |
sync |
Sync YAML to Unity Catalog (Phase 2) |
Quick Examples
/skill-test spark-declarative-pipelines run
/skill-test spark-declarative-pipelines add --trace
/skill-test spark-declarative-pipelines review --batch --filter-success
/skill-test my-new-skill init
See Workflows for detailed examples of each subcommand.
Execution Instructions
Environment Setup
uv pip install -e .test/
Environment variables for Databricks MLflow:
DATABRICKS_CONFIG_PROFILE– Databricks CLI profile (default: “DEFAULT”)MLFLOW_TRACKING_URI– Set to “databricks” for Databricks MLflowMLFLOW_EXPERIMENT_NAME– Experiment path (e.g., “/Users/{user}/skill-test”)
Running Scripts
All subcommands have corresponding scripts in .test/scripts/:
uv run python .test/scripts/{subcommand}.py {skill_name} [options]
| Subcommand | Script |
|---|---|
run |
run_eval.py |
regression |
regression.py |
init |
init_skill.py |
add |
add.py |
review |
review.py |
baseline |
baseline.py |
mlflow |
mlflow_eval.py |
scorers |
scorers.py |
scorers update |
scorers_update.py |
sync |
sync.py |
trace-eval |
trace_eval.py |
list-traces |
list_traces.py |
_routing mlflow |
routing_eval.py |
Use --help on any script for available options.
Command Handler
When /skill-test is invoked, parse arguments and execute the appropriate command.
Argument Parsing
args[0]= skill_name (required)args[1]= subcommand (optional, default: “run”)
Subcommand Routing
| Subcommand | Action |
|---|---|
run |
Execute run(skill_name, ctx) and display results |
regression |
Execute regression(skill_name, ctx) and display comparison |
init |
Execute init(skill_name, ctx) to create scaffolding |
add |
Prompt for test input, invoke skill, run interactive() |
review |
Execute review(skill_name, ctx) to review pending candidates |
baseline |
Execute baseline(skill_name, ctx) to save as regression baseline |
mlflow |
Execute mlflow_eval(skill_name, ctx) with MLflow logging |
scorers |
Execute scorers(skill_name, ctx) to list configured scorers |
scorers update |
Execute scorers_update(skill_name, ctx, ...) to modify scorers |
init Behavior
When running /skill-test <skill-name> init:
- Read the skill’s SKILL.md to understand its purpose
- Create
manifest.yamlwith appropriate scorers and trace_expectations - Create empty
ground_truth.yamlandcandidates.yamltemplates - Recommend test prompts based on documentation examples
Follow with /skill-test <skill-name> add using recommended prompts.
Context Setup
Create CLIContext with MCP tools before calling any command. See Python API for details.
File Locations
Important: All test files are stored at the repository root level, not relative to this skill’s directory.
| File Type | Path |
|---|---|
| Ground truth | {repo_root}/.test/skills/{skill-name}/ground_truth.yaml |
| Candidates | {repo_root}/.test/skills/{skill-name}/candidates.yaml |
| Manifest | {repo_root}/.test/skills/{skill-name}/manifest.yaml |
| Routing tests | {repo_root}/.test/skills/_routing/ground_truth.yaml |
| Baselines | {repo_root}/.test/baselines/{skill-name}/baseline.yaml |
For example, to test spark-declarative-pipelines in this repository:
/Users/.../ai-dev-kit/.test/skills/spark-declarative-pipelines/ground_truth.yaml
Not relative to the skill definition:
/Users/.../ai-dev-kit/.claude/skills/skill-test/skills/... # WRONG
Directory Structure
.test/ # At REPOSITORY ROOT (not skill directory)
âââ pyproject.toml # Package config (pip install -e ".test/")
âââ README.md # Contributor documentation
âââ SKILL.md # Source of truth (synced to .claude/skills/)
âââ install_skill_test.sh # Sync script
âââ scripts/ # Wrapper scripts
â âââ _common.py # Shared utilities
â âââ run_eval.py
â âââ regression.py
â âââ init_skill.py
â âââ add.py
â âââ baseline.py
â âââ mlflow_eval.py
â âââ routing_eval.py
â âââ trace_eval.py # Trace evaluation
â âââ list_traces.py # List available traces
â âââ scorers.py
â âââ scorers_update.py
â âââ sync.py
âââ src/
â âââ skill_test/ # Python package
â âââ cli/ # CLI commands module
â âââ fixtures/ # Test fixture setup
â âââ scorers/ # Evaluation scorers
â âââ grp/ # Generate-Review-Promote pipeline
â âââ runners/ # Evaluation runners
âââ skills/ # Per-skill test definitions
â âââ _routing/ # Routing test cases
â âââ {skill-name}/ # Skill-specific tests
â âââ ground_truth.yaml
â âââ candidates.yaml
â âââ manifest.yaml
âââ tests/ # Unit tests
âââ references/ # Documentation references
âââ baselines/ # Regression baselines
References
- Scorers – Available scorers and quality gates
- YAML Schemas – Manifest and ground truth formats
- Python API – Programmatic usage examples
- Workflows – Detailed example workflows
- Trace Evaluation – Session trace analysis