mlflow-evaluation

📁 databricks-solutions/ai-dev-kit 📅 4 days ago
1
总安装量
1
周安装量
#52844
全站排名
安装命令
npx skills add https://github.com/databricks-solutions/ai-dev-kit --skill mlflow-evaluation

Agent 安装分布

amp 1
opencode 1
kimi-cli 1
github-copilot 1
claude-code 1

Skill 文档

MLflow 3 GenAI Evaluation

Before Writing Any Code

  1. Read GOTCHAS.md – 15+ common mistakes that cause failures
  2. Read CRITICAL-interfaces.md – Exact API signatures and data schemas

End-to-End Workflows

Follow these workflows based on your goal. Each step indicates which reference files to read.

Workflow 1: First-Time Evaluation Setup

For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.

Step Action Reference Files
1 Understand what to evaluate user-journeys.md (Journey 0: Strategy)
2 Learn API patterns GOTCHAS.md + CRITICAL-interfaces.md
3 Build initial dataset patterns-datasets.md (Patterns 1-4)
4 Choose/create scorers patterns-scorers.md + CRITICAL-interfaces.md (built-in list)
5 Run evaluation patterns-evaluation.md (Patterns 1-3)

Workflow 2: Production Trace -> Evaluation Dataset

For building evaluation datasets from production traces.

Step Action Reference Files
1 Search and filter traces patterns-trace-analysis.md (MCP tools section)
2 Analyze trace quality patterns-trace-analysis.md (Patterns 1-7)
3 Tag traces for inclusion patterns-datasets.md (Patterns 16-17)
4 Build dataset from traces patterns-datasets.md (Patterns 6-7)
5 Add expectations/ground truth patterns-datasets.md (Pattern 2)

Workflow 3: Performance Optimization

For debugging slow or expensive agent execution.

Step Action Reference Files
1 Profile latency by span patterns-trace-analysis.md (Patterns 4-6)
2 Analyze token usage patterns-trace-analysis.md (Pattern 9)
3 Detect context issues patterns-context-optimization.md (Section 5)
4 Apply optimizations patterns-context-optimization.md (Sections 1-4, 6)
5 Re-evaluate to measure impact patterns-evaluation.md (Pattern 6-7)

Workflow 4: Regression Detection

For comparing agent versions and finding regressions.

Step Action Reference Files
1 Establish baseline patterns-evaluation.md (Pattern 4: named runs)
2 Run current version patterns-evaluation.md (Pattern 1)
3 Compare metrics patterns-evaluation.md (Patterns 6-7)
4 Analyze failing traces patterns-trace-analysis.md (Pattern 7)
5 Debug specific failures patterns-trace-analysis.md (Patterns 8-9)

Workflow 5: Custom Scorer Development

For creating project-specific evaluation metrics.

Step Action Reference Files
1 Understand scorer interface CRITICAL-interfaces.md (Scorer section)
2 Choose scorer pattern patterns-scorers.md (Patterns 4-11)
3 For multi-agent scorers patterns-scorers.md (Patterns 13-16)
4 Test with evaluation patterns-evaluation.md (Pattern 1)

Reference Files Quick Lookup

Reference Purpose When to Read
GOTCHAS.md Common mistakes Always read first before writing code
CRITICAL-interfaces.md API signatures, schemas When writing any evaluation code
patterns-evaluation.md Running evals, comparing When executing evaluations
patterns-scorers.md Custom scorer creation When built-in scorers aren’t enough
patterns-datasets.md Dataset building When preparing evaluation data
patterns-trace-analysis.md Trace debugging When analyzing agent behavior
patterns-context-optimization.md Token/latency fixes When agent is slow or expensive
user-journeys.md High-level workflows When starting a new evaluation project

Critical API Facts

  • Use: mlflow.genai.evaluate() (NOT mlflow.evaluate())
  • Data format: {"inputs": {"query": "..."}} (nested structure required)
  • predict_fn: Receives **unpacked kwargs (not a dict)

See GOTCHAS.md for complete list.