quality scoring
10
总安装量
0
周安装量
#30878
全站排名
安装命令
npx skills add https://github.com/akaszubski/autonomous-dev --skill 'Quality Scoring'
Skill 文档
Quality Scoring
Multi-dimensional assessment for training data quality.
When Activates
Quality assessment, data scoring, multi-dimensional evaluation, IFD scoring, factuality checks, reasoning validation, training data prep
Core Concepts
Quality Scorers (6 Types)
Fast to comprehensive scoring approaches:
- FastIFD – Instruction-following difficulty (10-20x faster)
- Quality – LLM-based quality (Qwen3-30B, 0.85 ex/s)
- MultiDimensional – 5-dimension composite
- LLMQuality – Multi-backend (MLX/OpenRouter)
- Ensemble – Cross-model ensemble
- Tulu3 – Multi-dimensional reference (training_metrics.py)
Quality Dimensions (6 Metrics)
- IFD Score (0.0-1.0) – Instruction-following difficulty
- Factuality (0.0-1.0) – Hallucination detection
- Reasoning (0.0-1.0) – Step-by-step logic quality
- Diversity (0.0-1.0) – Dataset-level diversity
- Domain (0.0-1.0) – Domain-specific relevance
- LLM Quality (1-10) – Tulu3 comprehensive score
Training Thresholds
| Type | Quality | IFD | Use Case |
|---|---|---|---|
| SFT | â¥8.0 | â¥0.3 | Base training |
| DPO chosen | â¥9.0 | â¥0.5 | High quality only |
| DPO rejected | â¤6.0 | any | Low quality |
| RLVR | â¥9.0 | â¥0.5 | Verified solutions |
| Calibration | â¥8.0 | â¥0.4 | Uncertainty examples |
Quick Reference
| Concept | Details | Reference |
|---|---|---|
| Scorers | 6 types (FastIFD to Ensemble) | quality-scorers.md |
| Dimensions | 6 metrics (IFD to LLM Quality) | quality-dimensions.md |
| Thresholds | By training type (SFT, DPO, RLVR) | training-thresholds.md |
| Library | training_metrics.py |
Integration functions |
IFD Score Calculation
from training_metrics import calculate_ifd_score
# IFD = PPL(response) / PPL(response|instruction)
ifd_score = calculate_ifd_score(
instruction="Explain quantum computing",
response="Quantum computing uses qubits..."
)
# Higher score = more challenging
DPO Pair Validation
from training_metrics import validate_dpo_pairs
# Validate chosen/rejected quality gap
is_valid = validate_dpo_pairs(
chosen_score=9.2, # High quality
rejected_score=5.8 # Low quality
)
# Ensures quality gap â¥0.15
REQUIRED: DPO Multi-Dimensional Scoring
Every DPO pair MUST have multi-dimensional quality scores before training.
This is a hard requirement â DPO data without quality scores will learn shortcuts (e.g., “longer = better”) instead of genuine preference signal.
Required output fields per pair:
chosen_score(float): Composite quality score for chosen responserejected_score(float): Composite quality score for rejected responsemargin(float): chosen_score – rejected_score (must be â¥3.0)
Length bias audit (MUST run before DPO training):
from pathlib import Path
from training_metrics import validate_dpo_pairs
metrics = validate_dpo_pairs(dpo_path=Path("dpo_pairs.jsonl"))
# Check length bias
longer_chosen = sum(1 for p in metrics.pairs if len(p.chosen) > len(p.rejected))
length_bias = longer_chosen / metrics.total_pairs
if length_bias > 0.70:
raise ValueError(
f"DPO length bias {length_bias:.0%} > 70% threshold.\n"
f"Model will learn 'longer = better' shortcut.\n"
f"Fix: Score by quality dimensions, not length."
)
# Check quality scores present
missing = sum(1 for p in metrics.pairs if p.chosen_score is None)
if missing > 0:
raise ValueError(f"{missing} pairs missing quality scores â run scoring first")
Scoring workflow:
- Generate DPO pairs (dpo-rlvr-generation skill)
- Score all pairs with multi-dimensional scorer (this skill)
- Filter by quality margin â¥3.0
- Audit length bias â¤70%
- Only then proceed to training
RLVR Verifiability
from training_metrics import assess_rlvr_verifiability
# Assess reasoning trace verifiability
verifiable = assess_rlvr_verifiability(
reasoning_trace="Step 1: ...\nStep 2: ...",
domain="math"
)
# Math/coding: 90%+ verifiable required
Progressive Disclosure
Detailed guides: See docs/*.md
docs/quality-scorers.md– 6 scorer implementationsdocs/quality-dimensions.md– 6 dimension definitionsdocs/training-thresholds.md– Thresholds, CLI, distributed performance
Security Considerations
Input Validation (CWE-20)
- Validate score ranges (0.0-1.0 or 1-10)
- Sanitize data inputs before scoring
- Check threshold values before application
Path Traversal (CWE-22)
- Sanitize file paths for data loading
- Whitelist directories for training data
- Validate output paths for scored datasets
Security Patterns (training_metrics.py)
from pathlib import Path
def safe_load_data(data_path: str) -> dict:
"""Load data with path validation."""
# Validate path within allowed directory
path = Path(data_path).resolve()
if not str(path).startswith('/allowed/data/'):
raise ValueError(f"Path outside allowed directory: {path}")
# Load safely
return json.loads(path.read_text())
Distributed Performance
Single Machine Performance
- M4 Max: ~0.85 ex/s (Qwen3-30B)
- M3 Ultra: ~0.85 ex/s (Qwen3-30B)
Parallel Processing
- Combined throughput: ~1.7 ex/s (50/50 split)
- Scaling: Linear with machine count
- Bottleneck: Model inference, not I/O
CLI Commands
# Score dataset with FastIFD
python -m training_metrics score \
--input data/train.jsonl \
--output data/scored.jsonl \
--scorer fastifd \
--threshold 0.3
# Multi-dimensional scoring
python -m training_metrics score \
--input data/train.jsonl \
--output data/scored.jsonl \
--scorer multidim \
--quality-threshold 8.0 \
--ifd-threshold 0.5
# DPO pair filtering
python -m training_metrics filter_dpo \
--input data/dpo_pairs.jsonl \
--output data/filtered_pairs.jsonl \
--chosen-threshold 9.0 \
--rejected-threshold 6.0
# RLVR verifiability check
python -m training_metrics assess_rlvr \
--input data/rlvr_traces.jsonl \
--output data/verified.jsonl \
--domain math \
--threshold 0.9
Related Skills
- data-distillation – IFD methodology and KenLM filtering
- preference-data-quality – DPO and RLVR metrics
- python-standards – Code quality standards
Library Integration
Primary library: training_metrics.py
Key functions:
calculate_ifd_score()– IFD calculationvalidate_dpo_pairs()– DPO pair validationassess_rlvr_verifiability()– RLVR assessmentscore_quality()– Multi-dimensional scoringensemble_score()– Cross-model ensemble
Key Takeaways
- 6 scorers – FastIFD (fast) to Ensemble (comprehensive)
- 6 dimensions – IFD, Factuality, Reasoning, Diversity, Domain, LLM Quality
- Training thresholds – SFT â¥8.0, DPO chosen â¥9.0, RLVR â¥9.0
- IFD score – PPL(response) / PPL(response|instruction), higher = harder
- Security – CWE-20 (input validation), CWE-22 (path traversal)
- Distributed – ~1.7 ex/s with 2 machines (linear scaling)
- CLI commands – training_metrics module for all operations
- Integration – Use training_metrics library functions
- DPO pairs – Chosen â¥9.0, Rejected â¤6.0, gap â¥0.15
- RLVR – Math/coding 90%+ verifiable, general 80%+
- DPO scoring REQUIRED – Every pair must have chosen_score, rejected_score, margin before training
- Length bias audit – â¤70% of pairs where chosen is longer (prevents “longer = better” shortcut)