quality-detect-regressions
npx skills add https://github.com/dawiddutoit/custom-claude --skill quality-detect-regressions
Agent 安装分布
Skill 文档
detect-quality-regressions
Purpose
Compare current quality metrics against a stored baseline to detect regressions in tests, coverage, type errors, linting, and dead code. Enforces quality standards by blocking task completion when metrics degrade beyond tolerance thresholds.
When to Use
MANDATORY invocation scenarios:
- After completing a task (before marking complete)
- Before creating a commit or pull request
- Before merging to main branch
- When user asks “is this done?” or “check quality”
User trigger phrases:
- “detect regressions”
- “check against baseline”
- “validate quality hasn’t degraded”
- “compare to baseline”
Quick Start
Basic usage after completing a task:
1. Complete your code changes
2. Run tests manually to verify they pass
3. Invoke this skill: "Detect regressions against baseline_feature_2025-10-16"
4. If PASS: Mark task complete
5. If FAIL: Fix regressions, re-run detection
This skill automatically:
- Loads baseline from memory
- Runs ./scripts/check_all.sh
- Compares 5 metrics (tests, coverage, type errors, linting, dead code)
- Detects regressions with tolerance rules
- Returns PASS/FAIL with actionable delta report
Table of Contents
Core Sections
- Instructions – Complete workflow for regression detection
- Step 1: Load Baseline from Memory – Retrieve and validate baseline metrics
- Step 2: Run Current Quality Checks – Execute check_all.sh and capture metrics
- Step 3: Compare Metrics (Regression Detection) – Apply comparison rules
- Step 4: Generate Delta Report – Create metric comparison report
- Step 5: Return Result – PASS/FAIL decision logic
- When to Invoke – Triggering conditions (after tasks, before commits, status checks)
- Examples – Real-world scenarios
- Example 1: No Regressions (PASS) – Successful validation scenario
- Example 2: Regression Detected (FAIL) – Handling quality degradation
- Edge Cases – Special situations (missing baseline, check failures, pre-existing issues)
Advanced Topics
- Integration Points – Coordination with other skills and agents
- Anti-Patterns to Avoid – Common mistakes and correct approaches
- Success Criteria – Validation checklist
- Supporting Files – References and examples
- Requirements – Environment, memory schema, tools
Instructions
Step 1: Load Baseline from Memory
Query memory for baseline:
Use mcp__memory__find_memories_by_name to retrieve the baseline:
baseline_names = ["baseline_<feature>_<date>"]
# Example: ["baseline_auth_2025-10-16"]
Validate baseline exists:
- If not found: Return
â ï¸ WARNING - No baseline found, suggest capturing baseline first - If found: Parse baseline metrics
Extract baseline metrics:
Parse the baseline entity’s observations to extract:
- Tests:
X passed, Y failed, Z skipped - Coverage:
X% - Type errors:
X errors - Linting errors:
X errors - Dead code:
X%
Example baseline observations:
- Tests: 145 passed, 0 failed, 3 skipped
- Coverage: 87%
- Type errors: 0
- Linting errors: 0
- Dead code: 1.2%
Step 2: Run Current Quality Checks
Execute quality checks:
cd /Users/dawiddutoit/projects/play/project-watch-mcp
./scripts/check_all.sh
Capture output:
- Save stdout and stderr
- Parse same 5 metrics as baseline
- Handle script failures (return FAIL if checks can’t run)
Parse current metrics:
Extract from check_all.sh output:
- Tests: Look for “X passed” in pytest output
- Coverage: Look for “TOTAL” line with percentage
- Type errors: Count errors in pyright output
- Linting errors: Count violations in ruff output
- Dead code: Parse vulture output for percentage
Step 3: Compare Metrics (Regression Detection)
Apply comparison rules:
| Metric | Rule | Tolerance | Regression If |
|---|---|---|---|
| Tests passed | Must be >= baseline | None | current < baseline |
| Coverage | Must be >= baseline – 1% | 1% | current < baseline – 1% |
| Type errors | Must be <= baseline | None | current > baseline |
| Linting | Must be <= baseline | None | current > baseline |
| Dead code | Must be <= baseline + 2% | 2% | current > baseline + 2% |
For each metric:
- Calculate change:
current - baseline - Check if regression: Apply rule from table
- Mark status:
improved,stable, orregressed - Calculate severity:
critical,high,medium,low
Regression severity:
- Critical: Type errors increased (breaks type safety)
- High: Tests decreased or coverage dropped >2%
- Medium: Linting errors increased
- Low: Dead code increased slightly (within tolerance)
Step 4: Generate Delta Report
Create comparison for each metric:
metric_name:
baseline: <value>
current: <value>
change: <+/- difference>
status: improved | stable | regressed
severity: critical | high | medium | low (if regressed)
Identify regressions:
Filter metrics where status == regressed and create regression list:
regressions:
- metric: tests
baseline: 152
current: 150
change: -2
severity: high
action: "Investigate test_user_service.py, test_auth_service.py"
Identify improvements:
Filter metrics where status == improved for positive feedback.
Step 5: Return Result
Decision logic:
IF any metric has status == regressed:
RETURN FAIL with regression list
ELSE:
RETURN PASS with improvements
PASS result format:
â
PASS - No regressions detected
Delta Report:
- Tests: +3 passed (148 total) ð
- Coverage: +2% (89% total) ð
- Type errors: No change (0) â
- Linting: No change (0) â
- Dead code: -0.1% (1.1% total) ð
All metrics maintained or improved. Safe to mark task complete.
FAIL result format:
ð´ FAIL - 4 regressions detected
Regressions:
1. Tests: -2 passed (150 vs 152)
â 2 tests removed or now failing
â Action: Investigate test_user_service.py, test_auth_service.py
2. Coverage: -4% (85% vs 89%)
â Coverage dropped below tolerance (88%)
â Action: Add tests for newly refactored code
3. Type errors: +2 new errors (5 vs 3)
â New type errors introduced
â Action: Run pyright --verbose, fix errors
4. Linting: +2 new errors
â New linting violations
â Action: Run ruff check --fix, review changes
â BLOCKED - Do not mark task complete until regressions fixed.
Fix order:
1. Fix linting (ruff check --fix)
2. Fix type errors (pyright)
3. Re-run tests (investigate failures)
4. Add coverage for new code
5. Re-run regression detection
When to Invoke
After completing a task (@implementer, @unit-tester, @integration-tester):
- Code changes complete
- Tests written and pass locally
- â Invoke detect-quality-regressions before marking task complete
- If PASS: Mark complete, move to next task
- If FAIL: Fix regressions, re-run detection
Before committing/merging:
- User requests commit
- â Invoke detect-quality-regressions to validate
- If PASS: Proceed with commit
- If FAIL: Block commit, report regressions
When checking status (@statuser):
- User asks “What’s the status?”
- â Invoke detect-quality-regressions to get current quality state
- Report status with delta
Examples
Example 1: No Regressions (PASS)
Context: Completed Task 2.1 (add user authentication)
Execution:
1. Load baseline: baseline_auth_2025-10-16
â Tests: 145 passed, 0 failed
â Coverage: 87%
â Type errors: 0
â Linting: 0
â Dead code: 1.2%
2. Run checks: ./scripts/check_all.sh
â Tests: 148 passed (+3), 0 failed
â Coverage: 89% (+2%)
â Type errors: 0 (no change)
â Linting: 0 (no change)
â Dead code: 1.1% (-0.1%)
3. Compare:
â
Tests: 148 >= 145 (PASS)
â
Coverage: 89% >= 86% (87% - 1%) (PASS)
â
Type errors: 0 <= 0 (PASS)
â
Linting: 0 <= 0 (PASS)
â
Dead code: 1.1% <= 3.2% (1.2% + 2%) (PASS)
4. Result: â
PASS
Example 2: Regression Detected (FAIL)
Context: Completed Task 3.2 (refactor service layer)
Execution:
1. Load baseline: baseline_service_result_2025-10-16
â Tests: 152 passed
â Coverage: 89%
â Type errors: 3
â Linting: 0
2. Run checks:
â Tests: 150 passed (-2) â
â Coverage: 85% (-4%) â
â Type errors: 5 (+2) â
â Linting: 2 (+2) â
3. Compare:
â Tests: 150 < 152 (REGRESSION)
â Coverage: 85% < 88% (89% - 1%) (REGRESSION)
â Type errors: 5 > 3 (REGRESSION)
â Linting: 2 > 0 (REGRESSION)
4. Result: ð´ FAIL - 4 regressions detected
See references/regression-fixes.md for detailed fix strategies
Edge Cases
1. Baseline Not Found
- Search memory, no baseline exists
- Return: â ï¸ WARNING – No baseline found
- Action: Suggest running capture-quality-baseline first
2. Quality Checks Fail to Run
- Script error, tool missing, etc.
- Return: ð´ FAIL – Unable to validate quality
- Block: Don’t allow work to proceed without validation
3. Pre-Existing Issues in Baseline
- Baseline has 3 documented type errors
- Current also has 3 type errors (same errors)
- Result: â PASS (no NEW errors)
- Note: “3 pre-existing errors maintained”
4. Tests Pass But Coverage Drops
- All tests pass (no failures)
- Coverage dropped 5% (regression)
- Result: ð´ FAIL (coverage regression)
- Action: Add tests for uncovered code
Integration Points
With capture-quality-baseline skill:
- This skill loads the baseline that capture-quality-baseline created
- Use same baseline naming convention:
baseline_<feature>_<date>
With run-quality-gates skill:
- run-quality-gates ensures quality checks pass (Definition of Done)
- detect-quality-regressions compares against baseline (regression detection)
- Both skills complement each other
With @implementer:
- Primary integration – runs after every task
- Blocks task completion if regressions detected
With @unit-tester / @integration-tester:
- Runs after writing tests to validate quality didn’t degrade
With manage-todo skill:
- Task state depends on regression detection result
- Can’t mark complete with regressions
Anti-Patterns to Avoid
â DON’T: Skip regression detection (always run before task complete) â DON’T: Ignore regressions (“I’ll fix later”) â DON’T: Commit without running detection â DON’T: Mark task complete with regressions â DON’T: Use wrong baseline (old or different feature)
â DO: Run detection after every task â DO: Block on regressions (fail fast) â DO: Fix regressions immediately â DO: Use correct baseline for feature â DO: Document pre-existing issues
Success Criteria
- â Baseline loaded successfully
- â Quality checks execute
- â All 5 metrics compared correctly
- â Regressions detected accurately
- â Clear PASS/FAIL result
- â Actionable delta report
Supporting Files
- references/comparison-rules.md – Detailed metric comparison rules and tolerances
- references/regression-fixes.md – Fix strategies for each metric type
Requirements
Environment:
- Project must be using ./scripts/check_all.sh
- Memory baseline must exist (captured via capture-quality-baseline skill)
- Quality tools must be installed: pyright, ruff, pytest, vulture
Memory Schema:
- Entity type: “quality_baseline”
- Entity name: “baseline__”
- Observations: List of metric values (see Step 1)