data-review
npx skills add https://github.com/wcygan/dotfiles --skill data-review
Agent 安装分布
Skill 文档
Data Review Skill
A multi-agent data platform review system that audits pipelines, warehouses, analytics infrastructure, and produces a prioritized health report with actionable recommendations.
Prerequisites
None required. Works with any data platform that provides:
- Pipeline configuration files (Airflow DAGs, dbt models, etc.)
- Query logs or analytics code
- Infrastructure-as-code (Terraform, CloudFormation)
- Data quality metrics or monitoring dashboards
Inputs
The user provides:
- Platform description (required) â Overview of the data platform architecture
- OR: Path to configuration directory (e.g.,
dbt/,airflow/dags/,terraform/)
- OR: Path to configuration directory (e.g.,
- Key data flows (optional) â Critical pipelines to prioritize, e.g. “user events â warehouse â BI dashboards”
- Known pain points (optional) â Current issues, slow queries, quality problems
- Compliance requirements (optional) â GDPR, HIPAA, SOC2, etc.
- Context file (optional) â Markdown file with platform details, constraints, team size
If the user doesn’t provide optional inputs, use reasonable defaults and note assumptions.
Agent Roster
Each agent has a specialized domain. All agents read the same platform artifacts in parallel.
| # | Agent | Focus | Reference |
|---|---|---|---|
| 1 | Data Engineer | Pipeline reliability, orchestration, dependencies, error handling, monitoring | agents/data-engineer.md |
| 2 | Data Scientist | Analytics quality, model pipelines, feature engineering, reproducibility | agents/data-scientist.md |
| 3 | Performance Analyst | Query optimization, indexing, partitioning, compute costs, bottlenecks | agents/performance-analyst.md |
| 4 | Security Auditor | Data governance, access controls, PII handling, compliance, lineage | agents/security-auditor.md |
| 5 | Synthesizer | Reads all agent reports â produces prioritized action plan with trade-offs | Built-in coordinator role |
Workflow
Phase 1: Discovery
- Confirm the platform description or config path
- Identify platform type (dbt, Airflow, Databricks, Snowflake, custom, etc.)
- Scan for key artifacts:
- Pipeline definitions (DAGs, models, workflows)
- Query files (SQL, notebooks)
- Infrastructure code (Terraform, YAML configs)
- Data quality tests or schema definitions
- Monitoring/alerting configurations
- Build a platform inventory listing:
- Pipeline count and types
- Data sources and destinations
- Compute/storage components
- Orchestration tools
- Save discovery results to
workspace/discovery.md
This discovery output is shared with all review agents as context.
Phase 2: Parallel Review (Sub-agents)
Spawn agents 1-4 in parallel using the Task tool. Each agent receives:
- The platform description and discovery file
- Their specific agent instructions (from
agents/*.md) - The review checklists and scoring rubric (
REFERENCE.md) - An output file path for their findings
Each agent:
- Reads relevant platform artifacts (configs, code, schemas)
- Applies their domain-specific audit checklist
- Scores each dimension (1-5 scale)
- Documents findings with file/line references
- Writes prioritized recommendations to their output file
Agent output files:
workspace/agents/data-engineer.mdworkspace/agents/data-scientist.mdworkspace/agents/performance-analyst.mdworkspace/agents/security-auditor.md
Phase 3: Architecture Debate
After all agents complete their audits, spawn a debate session:
- Create a debate prompt with all agent findings
- Agents discuss conflicting recommendations (e.g., performance vs. cost)
- Identify trade-offs and prioritization criteria
- Build consensus on critical vs. nice-to-have improvements
- Save debate transcript to
workspace/debate.md
Phase 4: Synthesis
The coordinator (you) acts as the Synthesizer:
- Read all 4 agent reports and debate transcript
- Deduplicate overlapping findings
- Categorize by severity (Critical / High / Medium / Low)
- Rank by impact-vs-effort for small teams
- Produce the final health report
Phase 5: Output
Generate the final deliverable using the report template in REFERENCE.md.
Save to workspace/data-platform-health-report.md and present to the user.
Output Structure
workspace/
âââ discovery.md # Platform inventory from Phase 1
âââ agents/
â âââ data-engineer.md
â âââ data-scientist.md
â âââ performance-analyst.md
â âââ security-auditor.md
âââ debate.md # Architecture trade-offs discussion
âââ data-platform-health-report.md # Final synthesized report
Coordinator Responsibilities
- Run the discovery phase to build platform inventory
- Spawn review agents in parallel with Task tool
- Ensure each agent has access to the discovery file and reference materials
- Collect all agent reports
- Facilitate the architecture debate (spawn debate agents or synthesize manually)
- Produce the final health report with scoring and prioritized recommendations
- Present the report to the user with executive summary
Customization
The user can customize the review by:
- Skipping agents: “Skip data science review, focus on infrastructure and performance”
- Focusing on specific pipelines: “Only review the user_events ETL and downstream models”
- Prioritizing dimensions: “I care most about compliance, less about performance”
- Adding comparisons: “Compare our approach to industry best practices for event streaming”
- Specifying constraints: “We’re a 2-person team, recommend low-maintenance solutions only”
- Setting compliance scope: “Audit for GDPR compliance specifically”
Adapt the agent roster and instructions accordingly.
Scoring System
Each agent rates their domain on a 1-5 scale:
- 5 (Excellent): Industry best practices, fully automated, no issues
- 4 (Good): Minor improvements possible, well-maintained
- 3 (Adequate): Functional but needs attention, some technical debt
- 2 (Poor): Significant issues, requires immediate action
- 1 (Critical): Broken or severely compromised, blocking business value
The final report includes:
- Overall platform health score (average across domains)
- Per-domain scores with justification
- Critical findings (score ⤠2)
- Quick wins (high impact, low effort)
- Strategic improvements (high impact, high effort)
Common Review Scenarios
Scenario 1: New Team Inheriting a Data Platform
/data-review "Inherited a Snowflake + dbt + Airflow stack. Need to understand health and risks."
Scenario 2: Pre-Migration Assessment
/data-review "Planning to migrate from on-prem Postgres to BigQuery. Audit current state."
Scenario 3: Performance Investigation
/data-review "Dashboard queries taking 2+ minutes. Focus on query optimization and indexing."
Scenario 4: Compliance Audit
/data-review "Need HIPAA compliance audit of our analytics platform. Check PII handling and access controls."
Scenario 5: Cost Optimization
/data-review "Warehouse costs doubled this quarter. Identify waste and optimization opportunities."
Agent Invocation Pattern
# Example internal workflow
use Task tool to spawn:
- data-engineer with discovery.md + REFERENCE.md â workspace/agents/data-engineer.md
- data-scientist with discovery.md + REFERENCE.md â workspace/agents/data-scientist.md
- performance-analyst with discovery.md + REFERENCE.md â workspace/agents/performance-analyst.md
- security-auditor with discovery.md + REFERENCE.md â workspace/agents/security-auditor.md
# Wait for all agents to complete
# Spawn debate session (optional)
use Task tool to spawn debate with all agent findings
# Synthesize final report
Read all outputs + debate transcript
Apply report template from REFERENCE.md
Generate data-platform-health-report.md
References
REFERENCE.mdâ Audit checklists, scoring rubric, common issues and fixesagents/data-engineer.mdâ Pipeline reliability and orchestration focusagents/data-scientist.mdâ Analytics quality and reproducibility focusagents/performance-analyst.mdâ Query optimization and cost efficiency focusagents/security-auditor.mdâ Governance, compliance, and access control focus