sw:increment-quality-judge-v2
npx skills add https://github.com/anton-abyzov/specweave --skill sw:increment-quality-judge-v2
Agent 安装分布
Skill 文档
Increment Quality Judge v2.0
LLM-as-Judge Pattern Implementation
AI-powered quality assessment using the LLM-as-Judge pattern – an established AI/ML evaluation technique where an LLM evaluates outputs with chain-of-thought reasoning, BMAD-pattern risk scoring, and formal quality gate decisions (PASS/CONCERNS/FAIL).
LLM-as-Judge: What It Is
LLM-as-Judge (LaaJ) is a recognized pattern in AI/ML evaluation where a large language model assesses quality using structured reasoning.
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
â LLM-as-Judge Pattern â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ¤
â Input: spec.md, plan.md, tasks.md â
â â
â Process: â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â <thinking> â â
â â 1. Read and understand the specification â â
â â 2. Evaluate against 7 quality dimensions â â
â â 3. Identify risks (PÃI scoring) â â
â â 4. Form evidence-based verdict â â
â â </thinking> â â
â âââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
â â
â Output: Structured verdict with: â
â ⢠Dimension scores (0-100) â
â ⢠Risk assessment (CRITICAL/HIGH/MEDIUM/LOW) â
â ⢠Quality gate decision (PASS/CONCERNS/FAIL) â
â ⢠Actionable recommendations â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Why LLM-as-Judge works:
- Consistency: Uniform evaluation criteria without human fatigue
- Reasoning: Chain-of-thought explains WHY something is an issue
- Scalability: Evaluates in seconds vs hours of manual review
- Industry standard: Used by OpenAI, Anthropic, Google for AI evals
References:
- “Judging LLM-as-a-Judge” (NeurIPS 2023)
- LMSYS Chatbot Arena evaluation methodology
- AlpacaEval, MT-Bench frameworks
IMPORTANT: This is a SKILL (Not an Agent)
DO NOT try to spawn this as an agent via Task tool.
This is a skill that auto-activates when you discuss quality assessment. To run quality assessment:
# Use the CLI command directly
specweave qa 0001 --pre
# Or use the slash command
/sw:qa 0001
The skill provides guidance and documentation. The CLI handles execution.
Why no agent? Having both a skill and agent with the same name (increment-quality-judge-v2) caused Claude to incorrectly construct agent type names. The skill-only approach eliminates this confusion.
What’s New in v2.0
- Risk Assessment Dimension – Probability à Impact scoring (0-10 scale, BMAD pattern)
- Quality Gate Decisions – Formal PASS/CONCERNS/FAIL with thresholds
- NFR Checking – Non-functional requirements (performance, security, scalability)
- Enhanced Output – Blockers, concerns, recommendations with actionable mitigations
- 7 Dimensions – Added “Risk” to the existing 6 dimensions
Purpose
Provide comprehensive quality assessment that goes beyond structural validation to evaluate:
- â Specification quality (6 dimensions)
- â Risk levels (BMAD PÃI scoring) – NEW!
- â Quality gate readiness (PASS/CONCERNS/FAIL) – NEW!
When to Use
Auto-activates for:
/qa {increment-id}command/qa {increment-id} --pre(pre-implementation check)/qa {increment-id} --gate(quality gate check)- Natural language: “assess quality of increment 0001”
Keywords:
- validate quality, quality check, assess spec
- evaluate increment, spec review, quality score
- risk assessment, qa check, quality gate
- PASS/CONCERNS/FAIL
Evaluation Dimensions (7 total, was 6)
dimensions:
clarity:
weight: 0.18 # was 0.20
criteria:
- "Is the problem statement clear?"
- "Are objectives well-defined?"
- "Is terminology consistent?"
testability:
weight: 0.22 # was 0.25
criteria:
- "Are acceptance criteria testable?"
- "Can success be measured objectively?"
- "Are edge cases identifiable?"
completeness:
weight: 0.18 # was 0.20
criteria:
- "Are all requirements addressed?"
- "Is error handling specified?"
- "Are non-functional requirements included?"
feasibility:
weight: 0.13 # was 0.15
criteria:
- "Is the architecture scalable?"
- "Are technical constraints realistic?"
- "Is timeline achievable?"
maintainability:
weight: 0.09 # was 0.10
criteria:
- "Is design modular?"
- "Are extension points identified?"
- "Is technical debt addressed?"
edge_cases:
weight: 0.09 # was 0.10
criteria:
- "Are failure scenarios covered?"
- "Are performance limits specified?"
- "Are security considerations included?"
# NEW: Risk Assessment (BMAD pattern)
risk:
weight: 0.11 # NEW!
criteria:
- "Are security risks identified and mitigated?"
- "Are technical risks (scalability, performance) addressed?"
- "Are implementation risks (complexity, dependencies) managed?"
- "Are operational risks (monitoring, support) considered?"
Risk Assessment (BMAD Pattern) – NEW!
Risk Scoring Formula
Risk Score = Probability à Impact
Probability (0.0-1.0):
- 0.0-0.3: Low (unlikely to occur)
- 0.4-0.6: Medium (may occur)
- 0.7-1.0: High (likely to occur)
Impact (1-10):
- 1-3: Minor (cosmetic, no user impact)
- 4-6: Moderate (some impact, workaround exists)
- 7-9: Major (significant impact, no workaround)
- 10: Critical (system failure, data loss, security breach)
Final Score (0.0-10.0):
- 9.0-10.0: CRITICAL risk (FAIL quality gate)
- 6.0-8.9: HIGH risk (CONCERNS quality gate)
- 3.0-5.9: MEDIUM risk (PASS with monitoring)
- 0.0-2.9: LOW risk (PASS)
Risk Categories
-
Security Risks
- OWASP Top 10 vulnerabilities
- Data exposure, authentication, authorization
- Cryptographic failures
-
Technical Risks
- Architecture complexity, scalability bottlenecks
- Performance issues, technical debt
-
Implementation Risks
- Tight timeline, external dependencies
- Technical complexity
-
Operational Risks
- Lack of monitoring, difficult to maintain
- Poor documentation
Risk Assessment Prompt
You are evaluating SOFTWARE RISKS for an increment using BMAD's Probability à Impact scoring.
Read increment files:
- .specweave/increments/{id}/spec.md
- .specweave/increments/{id}/plan.md
For EACH risk you identify:
1. **Calculate PROBABILITY** (0.0-1.0)
- Based on spec clarity, past experience, complexity
- Low: 0.2, Medium: 0.5, High: 0.8
2. **Calculate IMPACT** (1-10)
- 10 = Critical (security breach, data loss, system failure)
- 7-9 = Major (significant user impact, no workaround)
- 4-6 = Moderate (some impact, workaround exists)
- 1-3 = Minor (cosmetic, no user impact)
3. **Calculate RISK SCORE** = Probability à Impact
4. **Provide MITIGATION** strategy
5. **Link to ACCEPTANCE CRITERIA** (if applicable)
Output format (JSON):
{
"risks": [
{
"id": "RISK-001",
"category": "security",
"title": "Password storage not specified",
"description": "Spec doesn't mention password hashing algorithm",
"probability": 0.9,
"impact": 10,
"score": 9.0,
"severity": "CRITICAL",
"mitigation": "Use bcrypt or Argon2, never plain text",
"location": "spec.md, Authentication section",
"acceptance_criteria": "AC-US1-01"
}
],
"overall_risk_score": 7.5,
"dimension_score": 0.35
}
Quality Gate Decisions – NEW!
Decision Logic
enum QualityGateDecision {
PASS = "PASS", // Ready for production
CONCERNS = "CONCERNS", // Issues found, should address
FAIL = "FAIL" // Blockers, must fix
}
Thresholds (BMAD pattern):
FAIL if any:
- Risk score ⥠9.0 (CRITICAL)
- Test coverage < 60%
- Spec quality < 50
- Critical security vulnerabilities ⥠1
CONCERNS if any:
- Risk score 6.0-8.9 (HIGH)
- Test coverage < 80%
- Spec quality < 70
- High security vulnerabilities ⥠1
PASS otherwise
Output Example
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
QA ASSESSMENT: Increment 0008-user-authentication
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
Overall Score: 82/100 (GOOD) â
Dimension Scores:
Clarity: 90/100 ââ
Testability: 75/100 â ï¸
Completeness: 88/100 â
Feasibility: 85/100 â
Maintainability: 80/100 â
Edge Cases: 70/100 â ï¸
Risk Assessment: 65/100 â ï¸ (7.2/10 risk score)
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
RISKS IDENTIFIED (3)
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
ð´ RISK-001: CRITICAL (9.0/10)
Category: Security
Title: Password storage implementation
Description: Spec doesn't specify password hashing
Probability: 0.9 (High) Ã Impact: 10 (Critical)
Location: spec.md, Authentication section
Mitigation: Use bcrypt/Argon2, never plain text
AC: AC-US1-01
ð¡ RISK-002: HIGH (6.0/10)
Category: Security
Title: Rate limiting not specified
Description: No brute-force protection mentioned
Probability: 0.6 (Medium) Ã Impact: 10 (Critical)
Location: spec.md, Security section
Mitigation: Add 5 failed attempts â 15 min lockout
AC: AC-US1-03
ð¢ RISK-003: LOW (2.4/10)
Category: Technical
Title: Session storage scalability
Description: Plan uses in-memory sessions
Probability: 0.4 (Medium) Ã Impact: 6 (Moderate)
Location: plan.md, Architecture section
Mitigation: Use Redis for session store
Overall Risk Score: 7.2/10 (MEDIUM-HIGH)
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
QUALITY GATE DECISION
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
ð¡ CONCERNS (Not Ready for Production)
Blockers (MUST FIX):
1. ð´ CRITICAL RISK: Password storage (Risk â¥9)
â Add task: "Implement bcrypt password hashing"
Concerns (SHOULD FIX):
2. ð¡ HIGH RISK: Rate limiting not specified (Risk â¥6)
â Update spec.md: Add rate limiting section
â Add E2E test for rate limiting
3. â ï¸ Testability: 75/100 (target: 80+)
â Make acceptance criteria more measurable
Recommendations (NICE TO FIX):
4. Edge cases: 70/100
â Add error handling scenarios
5. Session scalability
â Consider Redis for session store
Decision: Address 1 blocker before proceeding
Would you like to:
[E] Export blockers to tasks.md
[U] Update spec.md with fixes (experimental)
[C] Continue without changes
Workflow Integration
Quick Mode (Default)
User: /sw:qa 0001
Step 1: Rule-based validation (120 checks) - FREE, FAST
âââ If FAILED â Stop, show errors
âââ If PASSED â Continue
Step 2: AI Quality Assessment (Quick)
âââ Spec quality (6 dimensions)
âââ Risk assessment (BMAD PÃI)
âââ Quality gate decision (PASS/CONCERNS/FAIL)
Output: Enhanced report with risks and gate decision
Pre-Implementation Mode
User: /sw:qa 0001 --pre
Checks:
â
Spec quality (clarity, testability, completeness)
â
Risk assessment (identify issues early)
â
Architecture review (plan.md soundness)
â
Test strategy (test plan in tasks.md)
Gate decision before implementation starts
Quality Gate Mode
User: /sw:qa 0001 --gate
Comprehensive checks:
â
All pre-implementation checks
â
Test coverage (AC-ID coverage, gaps)
â
E2E test coverage
â
Documentation completeness
Final gate decision before closing increment
Enhanced Scoring Algorithm
Step 1: Dimension Evaluation (7 dimensions)
For each dimension (including NEW risk dimension), use Chain-of-Thought prompting:
<thinking>
1. Read spec.md thoroughly
2. For risk dimension specifically:
- Identify all risks (security, technical, implementation, operational)
- For each risk: calculate P, I, Score
- Group by category
- Calculate overall risk score
3. For other dimensions: evaluate criteria as before
4. Score 0.00-1.00
5. Identify issues
6. Provide suggestions
</thinking>
Score: 0.XX
Step 2: Weighted Overall Score (NEW weights)
overall_score =
(clarity * 0.18) +
(testability * 0.22) +
(completeness * 0.18) +
(feasibility * 0.13) +
(maintainability * 0.09) +
(edge_cases * 0.09) +
(risk * 0.11) // NEW!
Step 3: Quality Gate Decision
gate_decision = decide({
spec_quality: overall_score,
risk_score: risk_assessment.overall_risk_score,
test_coverage: test_coverage.percentage, // if available
security_audit: security_audit // if available
})
Token Usage
Estimated per increment (Quick mode):
- Small spec (<100 lines):
2,500 tokens ($0.025) - Medium spec (100-250 lines):
3,500 tokens ($0.035) - Large spec (>250 lines):
5,000 tokens ($0.050)
Cost increase from v1.0: +25% (added risk assessment dimension)
Optimization:
- Only evaluate spec.md + plan.md for risks
- Cache risk patterns for 5 min
- Skip risk assessment if spec < 50 lines (too small to assess)
Configuration
{
"qa": {
"qualityGateThresholds": {
"fail": {
"riskScore": 9.0,
"testCoverage": 60,
"specQuality": 50,
"criticalVulnerabilities": 1
},
"concerns": {
"riskScore": 6.0,
"testCoverage": 80,
"specQuality": 70,
"highVulnerabilities": 1
}
},
"dimensions": {
"risk": {
"enabled": true,
"weight": 0.11
}
}
}
}
Migration from v1.0
v1.0 (6 dimensions):
- Clarity, Testability, Completeness, Feasibility, Maintainability, Edge Cases
v2.0 (7 dimensions, NEW: Risk):
- All v1.0 dimensions + Risk Assessment
- Weights adjusted to accommodate new dimension
- Quality gate decisions added
- BMAD risk scoring added
Backward Compatibility:
- v1.0 skills still work (auto-upgrade to v2.0 if risk assessment enabled)
- Existing scores rescaled to new weights automatically
- Can disable risk assessment in config to revert to v1.0 behavior
Best Practices
- Run early and often: Use
--premode before implementation - Fix blockers immediately: Don’t proceed if FAIL
- Address concerns before release: CONCERNS = should fix
- Use risk scores to prioritize: Fix CRITICAL risks first
- Export to tasks.md: Convert blockers/concerns to actionable tasks
Limitations
What quality-judge v2.0 CAN’T do:
- â Understand domain-specific compliance (HIPAA, PCI-DSS)
- â Verify technical feasibility with actual codebase
- â Replace human expertise and security audits
- â Predict actual probability without historical data
What quality-judge v2.0 CAN do:
- â Catch vague or ambiguous language
- â Identify missing security considerations (OWASP-based)
- â Spot untestable acceptance criteria
- â Suggest industry best practices
- â Flag missing edge cases
- â Assess risks systematically (BMAD pattern) – NEW!
- â Provide formal quality gate decisions – NEW!
Summary
increment-quality-judge v2.0 adds comprehensive risk assessment and quality gate decisions:
â Risk assessment (BMAD PÃI scoring, 0-10 scale) â Quality gate decisions (PASS/CONCERNS/FAIL with thresholds) â 7 dimensions (added “Risk” to existing 6) â NFR checking (performance, security, scalability) â Enhanced output (blockers, concerns, recommendations) â Chain-of-thought (LLM-as-Judge 2025 best practices) â Backward compatible (can disable risk assessment)
Use it when: You want comprehensive quality assessment with risk scoring and formal gate decisions before implementation or release.
Skip it when: Quick iteration, tight token budget, or simple features where rule-based validation suffices.
Version: 2.0.0 Related: /sw:qa command, QAOrchestrator agent
Project-Specific Learnings
Before starting work, check for project-specific learnings:
# Check if skill memory exists for this skill
cat .specweave/skill-memories/increment-quality-judge-v2.md 2>/dev/null || echo "No project learnings yet"
Project learnings are automatically captured by the reflection system when corrections or patterns are identified during development. These learnings help you understand project-specific conventions and past decisions.