validation-scripts

📁 vanman2024/ai-dev-marketplace 📅 2 days ago
4
总安装量
1
周安装量
#53741
全站排名
安装命令
npx skills add https://github.com/vanman2024/ai-dev-marketplace --skill validation-scripts

Agent 安装分布

amp 1
opencode 1
kimi-cli 1
codex 1
github-copilot 1
claude-code 1

Skill 文档

ML Training Validation Scripts

Purpose: Production-ready validation and testing utilities for ML training workflows. Ensures data quality, model integrity, pipeline correctness, and dependency availability before and during training.

Activation Triggers:

  • Validating training datasets before fine-tuning
  • Checking model checkpoints and outputs
  • Testing end-to-end training pipelines
  • Verifying system dependencies and GPU availability
  • Debugging training failures or data issues
  • Ensuring data format compliance
  • Validating model configurations
  • Testing inference pipelines

Key Resources:

  • scripts/validate-data.sh – Comprehensive dataset validation
  • scripts/validate-model.sh – Model checkpoint and config validation
  • scripts/test-pipeline.sh – End-to-end pipeline testing
  • scripts/check-dependencies.sh – System and package dependency checking
  • templates/test-config.yaml – Pipeline test configuration template
  • templates/validation-schema.json – Data validation schema template
  • examples/data-validation-example.md – Dataset validation workflow
  • examples/pipeline-testing-example.md – Complete pipeline testing guide

Quick Start

Validate Training Data

bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --format jsonl \
  --schema templates/validation-schema.json

Validate Model Checkpoint

bash scripts/validate-model.sh \
  --model-path ./checkpoints/epoch-3 \
  --framework pytorch \
  --check-weights

Test Complete Pipeline

bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --data ./data/sample.jsonl \
  --verbose

Check System Dependencies

bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

Validation Scripts

1. Data Validation (validate-data.sh)

Purpose: Validate training datasets for format compliance, data quality, and schema conformance.

Usage:

bash scripts/validate-data.sh [OPTIONS]

Options:

  • --data-path PATH – Path to dataset file or directory (required)
  • --format FORMAT – Data format: jsonl, csv, parquet, arrow (default: jsonl)
  • --schema PATH – Path to validation schema JSON file
  • --sample-size N – Number of samples to validate (default: all)
  • --check-duplicates – Check for duplicate entries
  • --check-null – Check for null/missing values
  • --check-length – Validate text length constraints
  • --check-tokens – Validate tokenization compatibility
  • --tokenizer MODEL – Tokenizer to use for token validation (default: gpt2)
  • --max-length N – Maximum sequence length (default: 2048)
  • --output REPORT – Output validation report path (default: validation-report.json)

Validation Checks:

  • Format Compliance: Verify file format and structure
  • Schema Validation: Check against JSON schema if provided
  • Data Types: Validate field types (string, int, float, etc.)
  • Required Fields: Ensure all required fields are present
  • Value Ranges: Check numeric values within expected ranges
  • Text Quality: Detect empty strings, excessive whitespace, encoding issues
  • Duplicates: Identify duplicate entries (exact or near-duplicate)
  • Token Counts: Verify sequences fit within model context length
  • Label Distribution: Check class balance for classification tasks
  • Missing Values: Detect and report null/missing values

Example Output:

{
  "status": "PASS",
  "total_samples": 10000,
  "valid_samples": 9987,
  "invalid_samples": 13,
  "validation_errors": [
    {
      "sample_id": 42,
      "field": "text",
      "error": "Exceeds max token length: 2150 > 2048"
    },
    {
      "sample_id": 156,
      "field": "label",
      "error": "Invalid label value: 'unknwn' (typo)"
    }
  ],
  "statistics": {
    "avg_text_length": 487,
    "avg_token_count": 128,
    "label_distribution": {
      "positive": 4892,
      "negative": 4895,
      "neutral": 213
    }
  },
  "recommendations": [
    "Remove or fix 13 invalid samples before training",
    "Label distribution is imbalanced - consider class weighting"
  ]
}

Exit Codes:

  • 0 – All validations passed
  • 1 – Validation errors found
  • 2 – Script error (invalid arguments, file not found)

2. Model Validation (validate-model.sh)

Purpose: Validate model checkpoints, configurations, and inference readiness.

Usage:

bash scripts/validate-model.sh [OPTIONS]

Options:

  • --model-path PATH – Path to model checkpoint directory (required)
  • --framework FRAMEWORK – Framework: pytorch, tensorflow, jax (default: pytorch)
  • --check-weights – Verify model weights are loadable
  • --check-config – Validate model configuration file
  • --check-tokenizer – Verify tokenizer files are present
  • --check-inference – Test inference with sample input
  • --sample-input TEXT – Sample text for inference test
  • --expected-output TEXT – Expected output for verification
  • --output REPORT – Output validation report path

Validation Checks:

  • File Structure: Verify required files are present
  • Config Validation: Check model_config.json for correctness
  • Weight Integrity: Load and verify model weights
  • Tokenizer Files: Ensure tokenizer files exist and are loadable
  • Model Architecture: Validate architecture matches config
  • Memory Requirements: Estimate GPU/CPU memory needed
  • Inference Test: Run sample inference if requested
  • Quantization: Verify quantized models load correctly
  • LoRA/PEFT: Validate adapter weights and configuration

Example Output:

{
  "status": "PASS",
  "model_path": "./checkpoints/llama-7b-finetuned",
  "framework": "pytorch",
  "checks": {
    "file_structure": "PASS",
    "config": "PASS",
    "weights": "PASS",
    "tokenizer": "PASS",
    "inference": "PASS"
  },
  "model_info": {
    "architecture": "LlamaForCausalLM",
    "parameters": "7.2B",
    "precision": "float16",
    "lora_enabled": true,
    "lora_rank": 16
  },
  "memory_estimate": {
    "model_size_gb": 13.5,
    "inference_vram_gb": 16.2,
    "training_vram_gb": 24.8
  },
  "inference_test": {
    "input": "Hello, world!",
    "output": "Hello, world! How can I help you today?",
    "latency_ms": 142
  }
}

3. Pipeline Testing (test-pipeline.sh)

Purpose: Test end-to-end ML training and inference pipelines with sample data.

Usage:

bash scripts/test-pipeline.sh [OPTIONS]

Options:

  • --config PATH – Pipeline test configuration file (required)
  • --data PATH – Sample data for testing
  • --steps STEPS – Comma-separated steps to test (default: all)
  • --verbose – Enable detailed output
  • --output REPORT – Output test report path
  • --fail-fast – Stop on first failure
  • --cleanup – Clean up temporary files after test

Pipeline Steps:

  • data_loading: Test data loading and preprocessing
  • tokenization: Verify tokenization pipeline
  • model_loading: Load model and verify initialization
  • training_step: Run single training step
  • validation_step: Run single validation step
  • checkpoint_save: Test checkpoint saving
  • checkpoint_load: Test checkpoint loading
  • inference: Test inference pipeline
  • metrics: Verify metrics calculation

Test Configuration (test-config.yaml):

pipeline:
  name: llama-7b-finetuning
  framework: pytorch

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  load_in_8bit: false
  lora:
    enabled: true
    r: 16
    alpha: 32

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]

Example Output:

Pipeline Test Report
====================

Pipeline: llama-7b-finetuning
Started: 2025-11-01 12:34:56
Duration: 127 seconds

Test Results:
✓ data_loading (2.3s) - Loaded 10 samples successfully
✓ tokenization (1.1s) - Tokenized all samples, avg length: 128 tokens
✓ model_loading (8.7s) - Model loaded, 7.2B parameters
✓ training_step (15.4s) - Training step completed, loss: 1.847
✓ validation_step (12.1s) - Validation step completed, loss: 1.923
✓ checkpoint_save (3.2s) - Checkpoint saved to ./checkpoints/test
✓ checkpoint_load (6.8s) - Checkpoint loaded successfully
✓ inference (2.9s) - Inference completed, latency: 142ms
✓ metrics (0.4s) - Metrics calculated correctly

Overall: PASS (8/8 tests passed)

Performance Metrics:
- Total time: 127s
- GPU memory used: 15.2 GB
- CPU memory used: 8.4 GB

Recommendations:
- Pipeline is ready for full training
- Consider increasing batch_size to improve throughput

4. Dependency Checking (check-dependencies.sh)

Purpose: Verify all required system dependencies, packages, and GPU availability.

Usage:

bash scripts/check-dependencies.sh [OPTIONS]

Options:

  • --framework FRAMEWORK – ML framework: pytorch, tensorflow, jax (default: pytorch)
  • --gpu-required – Require GPU availability
  • --min-vram GB – Minimum GPU VRAM required (GB)
  • --cuda-version VERSION – Required CUDA version
  • --packages FILE – Path to requirements.txt or packages list
  • --platform PLATFORM – Platform: modal, lambda, runpod, local (default: local)
  • --fix – Attempt to install missing packages
  • --output REPORT – Output dependency report path

Dependency Checks:

  • Python Version: Verify compatible Python version
  • ML Framework: Check PyTorch/TensorFlow/JAX installation
  • CUDA/cuDNN: Verify CUDA toolkit and cuDNN
  • GPU Availability: Detect and validate GPUs
  • VRAM: Check GPU memory capacity
  • System Packages: Verify system-level dependencies
  • Python Packages: Check required pip packages
  • Platform Tools: Validate platform-specific tools (Modal, Lambda CLI)
  • Storage: Check available disk space
  • Network: Test internet connectivity for model downloads

Example Output:

{
  "status": "PASS",
  "platform": "modal",
  "checks": {
    "python": {
      "status": "PASS",
      "version": "3.10.12",
      "required": ">=3.9"
    },
    "pytorch": {
      "status": "PASS",
      "version": "2.1.0",
      "cuda_available": true,
      "cuda_version": "12.1"
    },
    "gpu": {
      "status": "PASS",
      "count": 1,
      "type": "NVIDIA A100",
      "vram_gb": 40,
      "driver_version": "535.129.03"
    },
    "packages": {
      "status": "PASS",
      "installed": 42,
      "missing": 0,
      "outdated": 3
    },
    "storage": {
      "status": "PASS",
      "available_gb": 128,
      "required_gb": 50
    }
  },
  "recommendations": [
    "Update transformers to latest version (4.36.0 -> 4.37.2)",
    "Consider upgrading to PyTorch 2.2.0 for better performance"
  ]
}

Templates

Validation Schema Template (templates/validation-schema.json)

Defines expected data structure and validation rules for training datasets.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["text", "label"],
  "properties": {
    "text": {
      "type": "string",
      "minLength": 10,
      "maxLength": 5000,
      "description": "Input text for training"
    },
    "label": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"],
      "description": "Classification label"
    },
    "metadata": {
      "type": "object",
      "properties": {
        "source": {
          "type": "string"
        },
        "timestamp": {
          "type": "string",
          "format": "date-time"
        }
      }
    }
  },
  "validation_rules": {
    "max_token_length": 2048,
    "tokenizer": "meta-llama/Llama-2-7b-hf",
    "check_duplicates": true,
    "min_label_count": 100,
    "max_label_imbalance_ratio": 10.0
  }
}

Customization:

  • Modify required fields for your use case
  • Add custom properties and validation rules
  • Set appropriate length constraints
  • Define label enums for classification
  • Configure tokenizer and max length

Test Configuration Template (templates/test-config.yaml)

Defines pipeline testing parameters and expected behaviors.

pipeline:
  name: test-pipeline
  framework: pytorch
  platform: modal  # modal, lambda, runpod, local

data:
  train_path: ./data/sample-train.jsonl
  val_path: ./data/sample-val.jsonl
  format: jsonl  # jsonl, csv, parquet
  sample_size: 10  # Number of samples to use for testing

model:
  base_model: meta-llama/Llama-2-7b-hf
  checkpoint_path: ./checkpoints/test
  quantization: null  # null, 8bit, 4bit
  lora:
    enabled: true
    r: 16
    alpha: 32
    dropout: 0.05
    target_modules: ["q_proj", "v_proj"]

training:
  batch_size: 1
  gradient_accumulation: 1
  learning_rate: 2e-4
  max_steps: 5
  warmup_steps: 0
  eval_steps: 2

testing:
  sample_size: 10
  timeout_seconds: 300
  expected_loss_range: [0.5, 3.0]
  fail_fast: true
  cleanup: true

validation:
  metrics: ["loss", "perplexity"]
  check_gradients: true
  check_memory_leak: true

gpu:
  required: true
  min_vram_gb: 16
  allow_cpu_fallback: false

Integration with ML Training Workflow

Pre-Training Validation

# 1. Check system dependencies
bash scripts/check-dependencies.sh \
  --framework pytorch \
  --gpu-required \
  --min-vram 16

# 2. Validate training data
bash scripts/validate-data.sh \
  --data-path ./data/train.jsonl \
  --schema templates/validation-schema.json \
  --check-duplicates \
  --check-tokens

# 3. Test pipeline with sample data
bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --verbose

Post-Training Validation

# 1. Validate model checkpoint
bash scripts/validate-model.sh \
  --model-path ./checkpoints/final \
  --framework pytorch \
  --check-weights \
  --check-inference

# 2. Test inference pipeline
bash scripts/test-pipeline.sh \
  --config templates/test-config.yaml \
  --steps inference,metrics

CI/CD Integration

# .github/workflows/validate-training.yml
- name: Validate Training Data
  run: |
    bash plugins/ml-training/skills/validation-scripts/scripts/validate-data.sh \
      --data-path ./data/train.jsonl \
      --schema ./validation-schema.json

- name: Test Training Pipeline
  run: |
    bash plugins/ml-training/skills/validation-scripts/scripts/test-pipeline.sh \
      --config ./test-config.yaml \
      --fail-fast

Error Handling and Debugging

Common Validation Failures

Data Validation Failures:

  • Token length exceeded → Truncate or filter long sequences
  • Missing required fields → Fix data preprocessing
  • Duplicate entries → Deduplicate dataset
  • Invalid labels → Clean label data

Model Validation Failures:

  • Missing config files → Ensure complete checkpoint
  • Weight loading errors → Check framework compatibility
  • Tokenizer mismatch → Verify tokenizer version

Pipeline Test Failures:

  • Out of memory → Reduce batch size, enable gradient checkpointing
  • CUDA errors → Check GPU drivers, CUDA version
  • Slow performance → Profile and optimize data loading

Debug Mode

All scripts support verbose debugging:

bash scripts/validate-data.sh --data-path ./data/train.jsonl --verbose --debug

Outputs:

  • Detailed progress logs
  • Stack traces for errors
  • Performance metrics
  • Memory usage statistics

Best Practices

  1. Always validate before training – Catch data issues early
  2. Use schema validation – Enforce data quality standards
  3. Test with sample data first – Verify pipeline before full training
  4. Check dependencies on new platforms – Avoid runtime failures
  5. Validate checkpoints – Ensure model integrity before deployment
  6. Monitor validation metrics – Track data quality over time
  7. Automate validation in CI/CD – Prevent bad data from entering pipeline
  8. Keep validation schemas updated – Reflect current data requirements

Performance Tips

  • Use --sample-size for quick validation of large datasets
  • Run dependency checks once per environment, cache results
  • Parallelize validation with GNU parallel for multi-file datasets
  • Use --fail-fast in CI/CD to save time
  • Cache tokenizers to avoid re-downloading

Troubleshooting

Script Not Found:

# Ensure you're in the correct directory
cd /path/to/ml-training/skills/validation-scripts
bash scripts/validate-data.sh --help

Permission Denied:

# Make scripts executable
chmod +x scripts/*.sh

Missing Dependencies:

# Install required tools
pip install jsonschema pandas pyarrow
sudo apt-get install jq bc

Plugin: ml-training Version: 1.0.0 Supported Frameworks: PyTorch, TensorFlow, JAX Platforms: Modal, Lambda Labs, RunPod, Local