validation-scripts
npx skills add https://github.com/vanman2024/ai-dev-marketplace --skill validation-scripts
Agent 安装分布
Skill 文档
ML Training Validation Scripts
Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.
Activation Triggers:
- Validating training datasets before fine-tuning
- Checking model checkpoints and outputs
- Testing end-to-end training pipelines
- Verifying system dependencies and GPU availability
- Debugging training failures or data issues
- Ensuring data format compliance
- Validating model configurations
- Testing inference pipelines
Key Resources:
scripts/validate-data.sh– Comprehensive dataset validationscripts/validate-model.sh– Model checkpoint and config validationscripts/test-pipeline.sh– End-to-end pipeline testingscripts/check-dependencies.sh– System and package dependency checkingtemplates/test-config.yaml– Pipeline test configuration templatetemplates/validation-schema.json– Data validation schema templateexamples/data-validation-example.md– Dataset validation workflowexamples/pipeline-testing-example.md– Complete pipeline testing guide
Quick Start
Validate Training Data
bash scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--format jsonl \
--schema templates/validation-schema.json
Validate Model Checkpoint
bash scripts/validate-model.sh \
--model-path ./checkpoints/epoch-3 \
--framework pytorch \
--check-weights
Test Complete Pipeline
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--data ./data/sample.jsonl \
--verbose
Check System Dependencies
bash scripts/check-dependencies.sh \
--framework pytorch \
--gpu-required \
--min-vram 16
Validation Scripts
1. Data Validation (validate-data.sh)
Purpose: Validate training datasets for format compliance, data quality, and schema conformance.
Usage:
bash scripts/validate-data.sh [OPTIONS]
Options:
--data-path PATH– Path to dataset file or directory (required)--format FORMAT– Data format: jsonl, csv, parquet, arrow (default: jsonl)--schema PATH– Path to validation schema JSON file--sample-size N– Number of samples to validate (default: all)--check-duplicates– Check for duplicate entries--check-null– Check for null/missing values--check-length– Validate text length constraints--check-tokens– Validate tokenization compatibility--tokenizer MODEL– Tokenizer to use for token validation (default: gpt2)--max-length N– Maximum sequence length (default: 2048)--output REPORT– Output validation report path (default: validation-report.json)
Validation Checks:
- Format Compliance: Verify file format and structure
- Schema Validation: Check against JSON schema if provided
- Data Types: Validate field types (string, int, float, etc.)
- Required Fields: Ensure all required fields are present
- Value Ranges: Check numeric values within expected ranges
- Text Quality: Detect empty strings, excessive whitespace, encoding issues
- Duplicates: Identify duplicate entries (exact or near-duplicate)
- Token Counts: Verify sequences fit within model context length
- Label Distribution: Check class balance for classification tasks
- Missing Values: Detect and report null/missing values
Example Output:
{
"status": "PASS",
"total_samples": 10000,
"valid_samples": 9987,
"invalid_samples": 13,
"validation_errors": [
{
"sample_id": 42,
"field": "text",
"error": "Exceeds max token length: 2150 > 2048"
},
{
"sample_id": 156,
"field": "label",
"error": "Invalid label value: 'unknwn' (typo)"
}
],
"statistics": {
"avg_text_length": 487,
"avg_token_count": 128,
"label_distribution": {
"positive": 4892,
"negative": 4895,
"neutral": 213
}
},
"recommendations": [
"Remove or fix 13 invalid samples before training",
"Label distribution is imbalanced - consider class weighting"
]
}
Exit Codes:
0– All validations passed1– Validation errors found2– Script error (invalid arguments, file not found)
2. Model Validation (validate-model.sh)
Purpose: Validate model checkpoints, configurations, and inference readiness.
Usage:
bash scripts/validate-model.sh [OPTIONS]
Options:
--model-path PATH– Path to model checkpoint directory (required)--framework FRAMEWORK– Framework: pytorch, tensorflow, jax (default: pytorch)--check-weights– Verify model weights are loadable--check-config– Validate model configuration file--check-tokenizer– Verify tokenizer files are present--check-inference– Test inference with sample input--sample-input TEXT– Sample text for inference test--expected-output TEXT– Expected output for verification--output REPORT– Output validation report path
Validation Checks:
- File Structure: Verify required files are present
- Config Validation: Check model_config.json for correctness
- Weight Integrity: Load and verify model weights
- Tokenizer Files: Ensure tokenizer files exist and are loadable
- Model Architecture: Validate architecture matches config
- Memory Requirements: Estimate GPU/CPU memory needed
- Inference Test: Run sample inference if requested
- Quantization: Verify quantized models load correctly
- LoRA/PEFT: Validate adapter weights and configuration
Example Output:
{
"status": "PASS",
"model_path": "./checkpoints/llama-7b-finetuned",
"framework": "pytorch",
"checks": {
"file_structure": "PASS",
"config": "PASS",
"weights": "PASS",
"tokenizer": "PASS",
"inference": "PASS"
},
"model_info": {
"architecture": "LlamaForCausalLM",
"parameters": "7.2B",
"precision": "float16",
"lora_enabled": true,
"lora_rank": 16
},
"memory_estimate": {
"model_size_gb": 13.5,
"inference_vram_gb": 16.2,
"training_vram_gb": 24.8
},
"inference_test": {
"input": "Hello, world!",
"output": "Hello, world! How can I help you today?",
"latency_ms": 142
}
}
3. Pipeline Testing (test-pipeline.sh)
Purpose: Test end-to-end ML training and inference pipelines with sample data.
Usage:
bash scripts/test-pipeline.sh [OPTIONS]
Options:
--config PATH– Pipeline test configuration file (required)--data PATH– Sample data for testing--steps STEPS– Comma-separated steps to test (default: all)--verbose– Enable detailed output--output REPORT– Output test report path--fail-fast– Stop on first failure--cleanup– Clean up temporary files after test
Pipeline Steps:
- data_loading: Test data loading and preprocessing
- tokenization: Verify tokenization pipeline
- model_loading: Load model and verify initialization
- training_step: Run single training step
- validation_step: Run single validation step
- checkpoint_save: Test checkpoint saving
- checkpoint_load: Test checkpoint loading
- inference: Test inference pipeline
- metrics: Verify metrics calculation
Test Configuration (test-config.yaml):
pipeline:
name: llama-7b-finetuning
framework: pytorch
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
load_in_8bit: false
lora:
enabled: true
r: 16
alpha: 32
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
Example Output:
Pipeline Test Report
====================
Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds
Test Results:
â data_loading (2.3s) - Loaded 10 samples successfully
â tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
â model_loading (8.7s) - Model loaded, 7.2B parameters
â training_step (15.4s) - Training step completed, loss: 1.847
â validation_step (12.1s) - Validation step completed, loss: 1.923
â checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
â checkpoint_load (6.8s) - Checkpoint loaded successfully
â inference (2.9s) - Inference completed, latency: 142ms
â metrics (0.4s) - Metrics calculated correctly
Overall: PASS (8/8 tests passed)
Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB
Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput
4. Dependency Checking (check-dependencies.sh)
Purpose: Verify all required system dependencies, packages, and GPU availability.
Usage:
bash scripts/check-dependencies.sh [OPTIONS]
Options:
--framework FRAMEWORK– ML framework: pytorch, tensorflow, jax (default: pytorch)--gpu-required– Require GPU availability--min-vram GB– Minimum GPU VRAM required (GB)--cuda-version VERSION– Required CUDA version--packages FILE– Path to requirements.txt or packages list--platform PLATFORM– Platform: modal, lambda, runpod, local (default: local)--fix– Attempt to install missing packages--output REPORT– Output dependency report path
Dependency Checks:
- Python Version: Verify compatible Python version
- ML Framework: Check PyTorch/TensorFlow/JAX installation
- CUDA/cuDNN: Verify CUDA toolkit and cuDNN
- GPU Availability: Detect and validate GPUs
- VRAM: Check GPU memory capacity
- System Packages: Verify system-level dependencies
- Python Packages: Check required pip packages
- Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
- Storage: Check available disk space
- Network: Test internet connectivity for model downloads
Example Output:
{
"status": "PASS",
"platform": "modal",
"checks": {
"python": {
"status": "PASS",
"version": "3.10.12",
"required": ">=3.9"
},
"pytorch": {
"status": "PASS",
"version": "2.1.0",
"cuda_available": true,
"cuda_version": "12.1"
},
"gpu": {
"status": "PASS",
"count": 1,
"type": "NVIDIA A100",
"vram_gb": 40,
"driver_version": "535.129.03"
},
"packages": {
"status": "PASS",
"installed": 42,
"missing": 0,
"outdated": 3
},
"storage": {
"status": "PASS",
"available_gb": 128,
"required_gb": 50
}
},
"recommendations": [
"Update transformers to latest version (4.36.0 -> 4.37.2)",
"Consider upgrading to PyTorch 2.2.0 for better performance"
]
}
Templates
Validation Schema Template (templates/validation-schema.json)
Defines expected data structure and validation rules for training datasets.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": ["text", "label"],
"properties": {
"text": {
"type": "string",
"minLength": 10,
"maxLength": 5000,
"description": "Input text for training"
},
"label": {
"type": "string",
"enum": ["positive", "negative", "neutral"],
"description": "Classification label"
},
"metadata": {
"type": "object",
"properties": {
"source": {
"type": "string"
},
"timestamp": {
"type": "string",
"format": "date-time"
}
}
}
},
"validation_rules": {
"max_token_length": 2048,
"tokenizer": "meta-llama/Llama-2-7b-hf",
"check_duplicates": true,
"min_label_count": 100,
"max_label_imbalance_ratio": 10.0
}
}
Customization:
- Modify
requiredfields for your use case - Add custom properties and validation rules
- Set appropriate length constraints
- Define label enums for classification
- Configure tokenizer and max length
Test Configuration Template (templates/test-config.yaml)
Defines pipeline testing parameters and expected behaviors.
pipeline:
name: test-pipeline
framework: pytorch
platform: modal # modal, lambda, runpod, local
data:
train_path: ./data/sample-train.jsonl
val_path: ./data/sample-val.jsonl
format: jsonl # jsonl, csv, parquet
sample_size: 10 # Number of samples to use for testing
model:
base_model: meta-llama/Llama-2-7b-hf
checkpoint_path: ./checkpoints/test
quantization: null # null, 8bit, 4bit
lora:
enabled: true
r: 16
alpha: 32
dropout: 0.05
target_modules: ["q_proj", "v_proj"]
training:
batch_size: 1
gradient_accumulation: 1
learning_rate: 2e-4
max_steps: 5
warmup_steps: 0
eval_steps: 2
testing:
sample_size: 10
timeout_seconds: 300
expected_loss_range: [0.5, 3.0]
fail_fast: true
cleanup: true
validation:
metrics: ["loss", "perplexity"]
check_gradients: true
check_memory_leak: true
gpu:
required: true
min_vram_gb: 16
allow_cpu_fallback: false
Integration with ML Training Workflow
Pre-Training Validation
# 1. Check system dependencies
bash scripts/check-dependencies.sh \
--framework pytorch \
--gpu-required \
--min-vram 16
# 2. Validate training data
bash scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--schema templates/validation-schema.json \
--check-duplicates \
--check-tokens
# 3. Test pipeline with sample data
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--verbose
Post-Training Validation
# 1. Validate model checkpoint
bash scripts/validate-model.sh \
--model-path ./checkpoints/final \
--framework pytorch \
--check-weights \
--check-inference
# 2. Test inference pipeline
bash scripts/test-pipeline.sh \
--config templates/test-config.yaml \
--steps inference,metrics
CI/CD Integration
# .github/workflows/validate-training.yml
- name: Validate Training Data
run: |
bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh \
--data-path ./data/train.jsonl \
--schema ./validation-schema.json
- name: Test Training Pipeline
run: |
bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh \
--config ./test-config.yaml \
--fail-fast
Error Handling and Debugging
Common Validation Failures
Data Validation Failures:
- Token length exceeded â Truncate or filter long sequences
- Missing required fields â Fix data preprocessing
- Duplicate entries â Deduplicate dataset
- Invalid labels â Clean label data
Model Validation Failures:
- Missing config files â Ensure complete checkpoint
- Weight loading errors â Check framework compatibility
- Tokenizer mismatch â Verify tokenizer version
Pipeline Test Failures:
- Out of memory â Reduce batch size, enable gradient checkpointing
- CUDA errors â Check GPU drivers, CUDA version
- Slow performance â Profile and optimize data loading
Debug Mode
All scripts support verbose debugging:
bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug
Outputs:
- Detailed progress logs
- Stack traces for errors
- Performance metrics
- Memory usage statistics
Best Practices
- Always validate before training – Catch data issues early
- Use schema validation – Enforce data quality standards
- Test with sample data first – Verify pipeline before full training
- Check dependencies on new platforms – Avoid runtime failures
- Validate checkpoints – Ensure model integrity before deployment
- Monitor validation metrics – Track data quality over time
- Automate validation in CI/CD – Prevent bad data from entering pipeline
- Keep validation schemas updated – Reflect current data requirements
Performance Tips
- Use
--sample-sizefor quick validation of large datasets - Run dependency checks once per environment, cache results
- Parallelize validation with GNU parallel for multi-file datasets
- Use
--fail-fastin CI/CD to save time - Cache tokenizers to avoid re-downloading
Troubleshooting
Script Not Found:
# Ensure you're in the correct directory
cd /path/to/ml-training/skills/validation-scripts
bash scripts/validate-data.sh --help
Permission Denied:
# Make scripts executable
chmod +x scripts/*.sh
Missing Dependencies:
# Install required tools
pip install jsonschema pandas pyarrow
sudo apt-get install jq bc
Plugin: ml-training Version: 1.0.0 Supported Frameworks: PyTorch, TensorFlow, JAX Platforms: Modal, Lambda Labs, RunPod, Local