ml-research

📁 pranav-karra-3301/skills 📅 10 days ago

总安装量

周安装量

#70361

全站排名

安装命令

npx skills add https://github.com/pranav-karra-3301/skills --skill ml-research

Agent 安装分布

amp 2

gemini-cli 2

github-copilot 2

codex 2

kimi-cli 2

opencode 2

Skill 文档

ML Research Skill

Overview

This skill provides comprehensive support for ML/AI research experiments, finetuning, and training. It helps with:

System Detection: GPU, CUDA, memory, framework versions
Project Setup: Cookiecutter-style ML project structure with proper documentation
Experiment Tracking: W&B, MLflow, TensorBoard configuration
Best Practices: Reproducibility, data handling, training loops
Debugging: Common mistakes, memory issues, convergence problems
Visualization: Publication-quality plots, colorblind-friendly palettes

Core Workflow

Phase 1: System Detection

Before any ML work, detect the compute environment:

1. GPU Detection

# Check for NVIDIA GPU
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader 2>/dev/null || echo "No NVIDIA GPU detected"

# Check CUDA version
nvcc --version 2>/dev/null | grep "release" || echo "CUDA not found"

# Check cuDNN (if accessible)
cat /usr/local/cuda/include/cudnn_version.h 2>/dev/null | grep CUDNN_MAJOR -A 2 || echo "cuDNN version not directly accessible"

2. Python Environment

# Python version
python3 --version

# Virtual environment detection
echo $VIRTUAL_ENV $CONDA_DEFAULT_ENV

# Check for common ML frameworks
python3 -c "import torch; print(f'PyTorch {torch.__version__}, CUDA: {torch.cuda.is_available()}')" 2>/dev/null
python3 -c "import tensorflow as tf; print(f'TensorFlow {tf.__version__}')" 2>/dev/null
python3 -c "import jax; print(f'JAX {jax.__version__}')" 2>/dev/null

3. Memory Detection

# System RAM
free -h 2>/dev/null || sysctl hw.memsize 2>/dev/null

# GPU memory (via torch if available)
python3 -c "import torch; print(f'GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB')" 2>/dev/null

4. ML Tools Detection

# Check for experiment tracking
python3 -c "import wandb; print(f'W&B {wandb.__version__}')" 2>/dev/null
python3 -c "import mlflow; print(f'MLflow {mlflow.__version__}')" 2>/dev/null

# Check for config management
python3 -c "import hydra; print(f'Hydra {hydra.__version__}')" 2>/dev/null

Run comprehensive detection:

python3 scripts/detect_system.py

See references/gpu-management.md for compute optimization.

Phase 2: Task Understanding

Identify the ML task type:

Task Type	Key Indicators	Critical Checks
Training from scratch	New model, random init	Data size, compute budget
Finetuning	Pretrained model, adaptation	Base model selection, LR schedule
Evaluation	Metrics, benchmarking	No data leakage, proper splits
Inference	Deployment, serving	Batch size, latency requirements

Model Domain Detection:

Vision: Images, CNNs, ViT, augmentation
NLP/LLM: Text, transformers, tokenization
Multimodal: Vision + language, CLIP-style
Tabular: Structured data, gradient boosting often better
RL: Environment, policy, reward

Scale Assessment:

Local: Single GPU, fits in memory
Multi-GPU: DataParallel or DDP
Distributed: Multiple nodes, FSDP/DeepSpeed
Cloud: Managed services (SageMaker, Vertex AI)

Phase 3: Gap Analysis

Check the project against ML best practices:

Reproducibility Checklist

Random seeds set (Python, NumPy, PyTorch/TF)
Deterministic mode enabled (if needed)
Environment captured (requirements.txt, conda env)
Data versioning (DVC, direct hashing)
Code versioning (git commit tracked in experiments)

Data Handling Checklist

Train/val/test splits defined before any preprocessing
No data leakage between splits
Data validation (schema, distributions)
Preprocessing pipeline reproducible
Augmentations only on training data

Training Checklist

Experiment tracking configured (W&B, MLflow)
Logging (not just print statements)
Checkpointing enabled
Early stopping configured
Metrics defined and tracked

Documentation Checklist

CLAUDE.md exists with project context
AGENTS.md for multi-agent work
README with setup instructions
Experiment notes/logs

Run validation:

python3 scripts/validate_experiment.py

See references/common-mistakes.md for detailed issues.

Phase 4: Project Setup / Code Generation

Recommended Project Structure

When setting up an ML project, discuss with the user what structure fits their needs. A typical ML research project includes:

my_experiment/
âââ src/
â   âââ __init__.py
â   âââ data/
â   â   âââ __init__.py
â   â   âââ dataset.py       # Dataset classes
â   â   âââ preprocessing.py # Data transforms
â   âââ models/
â   â   âââ __init__.py
â   â   âââ model.py         # Model definitions
â   âââ training/
â   â   âââ __init__.py
â   â   âââ trainer.py       # Training loop
â   â   âââ losses.py        # Custom losses
â   âââ utils/
â       âââ __init__.py
â       âââ logging.py       # Logging utilities
â       âââ reproducibility.py # Seed setting
âââ configs/
â   âââ config.yaml          # Main Hydra config
â   âââ model/
â   â   âââ default.yaml
â   âââ data/
â   â   âââ default.yaml
â   âââ training/
â       âââ default.yaml
âââ scripts/
â   âââ train.py             # Main training entry
â   âââ evaluate.py          # Evaluation script
â   âââ inference.py         # Inference script
âââ tests/
â   âââ __init__.py
â   âââ test_data.py
â   âââ test_model.py
â   âââ conftest.py          # pytest fixtures
âââ notebooks/
â   âââ exploration.ipynb
âââ data/                    # .gitignored
âââ outputs/                 # .gitignored
âââ experiments/             # .gitignored
âââ CLAUDE.md
âââ AGENTS.md
âââ README.md
âââ pyproject.toml
âââ .gitignore
âââ .env.example

See references/project-structure.md for templates.

Configuration Generation

Hydra Config Template:

# configs/config.yaml
defaults:
  - model: default
  - data: default
  - training: default
  - _self_

seed: 42
experiment_name: ${now:%Y-%m-%d_%H-%M-%S}

hydra:
  run:
    dir: outputs/${experiment_name}
  sweep:
    dir: outputs/sweeps/${experiment_name}

Experiment Tracking Setup

W&B Integration:

import wandb

wandb.init(
    project="my-project",
    config=dict(cfg),  # Hydra config
    tags=["experiment-type"],
    notes="Description of this run"
)

# Log metrics
wandb.log({"loss": loss, "accuracy": acc})

# Log artifacts
wandb.save("model.pt")

MLflow Integration:

import mlflow

mlflow.set_tracking_uri("sqlite:///mlflow.db")  # or remote URI
mlflow.set_experiment("my-experiment")

with mlflow.start_run():
    mlflow.log_params(dict(cfg))
    mlflow.log_metrics({"loss": loss, "accuracy": acc})
    mlflow.log_artifact("model.pt")

See references/experiment-tracking.md for detailed setup.

Phase 5: Validation & Verification

Before training, run sanity checks:

1. Data Pipeline Verification

# Check data loader
batch = next(iter(train_loader))
print(f"Batch shape: {batch['x'].shape}")
print(f"Labels: {batch['y'].unique()}")

# Verify no leakage
train_ids = set(train_dataset.ids)
val_ids = set(val_dataset.ids)
assert len(train_ids & val_ids) == 0, "Data leakage detected!"

2. Model Sanity Checks

# Verify forward pass
model.eval()
with torch.no_grad():
    output = model(batch['x'].to(device))
    print(f"Output shape: {output.shape}")

# Check gradient flow
model.train()
output = model(batch['x'].to(device))
loss = criterion(output, batch['y'].to(device))
loss.backward()

for name, param in model.named_parameters():
    if param.grad is None:
        print(f"WARNING: No gradient for {name}")

3. Reproducibility Test

# Run twice with same seed, verify identical results
results1 = train_one_step(seed=42)
results2 = train_one_step(seed=42)
assert torch.allclose(results1, results2), "Not reproducible!"

4. Memory Estimation

# Estimate memory usage
def estimate_memory(model, batch_size, input_size):
    params = sum(p.numel() * p.element_size() for p in model.parameters())
    grads = params  # Gradients same size as params
    optimizer = params * 2  # Adam has 2 states per param
    activations = batch_size * input_size * 4  # Rough estimate

    total_gb = (params + grads + optimizer + activations) / 1e9
    return total_gb

See references/testing.md for comprehensive test examples.

Language-Specific Setup

PyTorch Project

Essential imports:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
import numpy as np
import random

def set_seed(seed: int = 42):
    """Set all seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # For complete determinism (may slow down training)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

Training loop template:

def train_epoch(model, loader, optimizer, criterion, device, scaler=None):
    model.train()
    total_loss = 0

    for batch in loader:
        x, y = batch['x'].to(device), batch['y'].to(device)

        optimizer.zero_grad()

        # Mixed precision training
        if scaler:
            with torch.cuda.amp.autocast():
                output = model(x)
                loss = criterion(output, y)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            output = model(x)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()

        total_loss += loss.item()

    return total_loss / len(loader)

TensorFlow/Keras Project

Essential setup:

import tensorflow as tf
import numpy as np
import random

def set_seed(seed: int = 42):
    """Set all seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

# Enable mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')

JAX/Flax Project

Essential setup:

import jax
import jax.numpy as jnp
from flax import linen as nn
from flax.training import train_state

# JAX uses explicit PRNG
key = jax.random.PRNGKey(42)

Finetuning Best Practices

See references/finetuning.md for comprehensive guide.

Quick Reference

Aspect	Training from Scratch	Finetuning
Learning Rate	1e-3 to 1e-4	1e-5 to 1e-6 (10-100x smaller)
Epochs	Many (100+)	Few (3-10)
Data Size	Large	Can be small
Compute	High	Lower

LLM Finetuning

LoRA Setup (PEFT):

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,              # Rank
    lora_alpha=32,     # Scaling
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()

QLoRA (4-bit):

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

GPU Management

See references/gpu-management.md for detailed optimization.

Quick Memory Estimation

def estimate_training_memory_gb(
    model_params_billions: float,
    batch_size: int,
    seq_length: int = 512,
    precision: str = "fp16"  # fp32, fp16, bf16
) -> float:
    """Rough estimate of GPU memory for training."""
    bytes_per_param = 2 if precision in ["fp16", "bf16"] else 4

    # Model parameters
    param_memory = model_params_billions * 1e9 * bytes_per_param

    # Gradients (same size as params)
    grad_memory = param_memory

    # Optimizer states (Adam: 2x params in fp32)
    optimizer_memory = model_params_billions * 1e9 * 4 * 2

    # Activations (rough estimate)
    activation_memory = batch_size * seq_length * 4096 * bytes_per_param

    total_bytes = param_memory + grad_memory + optimizer_memory + activation_memory
    return total_bytes / 1e9

Batch Size Tuning

def find_optimal_batch_size(model, sample_input, device, start_batch=1):
    """Binary search for maximum batch size."""
    batch_size = start_batch

    while True:
        try:
            # Create batch
            batch = sample_input.repeat(batch_size, *([1] * (sample_input.dim() - 1)))
            batch = batch.to(device)

            # Forward + backward
            model.zero_grad()
            output = model(batch)
            loss = output.mean()
            loss.backward()

            # Success, try larger
            print(f"Batch size {batch_size}: OK")
            batch_size *= 2

            torch.cuda.empty_cache()

        except RuntimeError as e:
            if "out of memory" in str(e):
                torch.cuda.empty_cache()
                optimal = batch_size // 2
                print(f"OOM at {batch_size}, optimal: {optimal}")
                return optimal
            raise

Visualization

See references/visualization.md for complete guide.

Publication-Quality Matplotlib

import matplotlib.pyplot as plt

# Publication settings
plt.rcParams.update({
    'font.size': 12,
    'font.family': 'serif',
    'axes.labelsize': 14,
    'axes.titlesize': 16,
    'xtick.labelsize': 12,
    'ytick.labelsize': 12,
    'legend.fontsize': 11,
    'figure.figsize': (8, 6),
    'figure.dpi': 150,
    'savefig.dpi': 300,
    'savefig.bbox': 'tight'
})

Colorblind-Friendly Palettes

# Recommended palettes
COLORBLIND_SAFE = ['#0077BB', '#33BBEE', '#009988', '#EE7733', '#CC3311', '#EE3377']

# Or use built-in
plt.style.use('seaborn-v0_8-colorblind')
# Or: cmap = plt.cm.viridis / plt.cm.cividis

Standard ML Plots

def plot_training_curves(history, save_path=None):
    """Plot training and validation curves."""
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # Loss
    axes[0].plot(history['train_loss'], label='Train')
    axes[0].plot(history['val_loss'], label='Validation')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss')
    axes[0].legend()
    axes[0].set_title('Loss Curves')

    # Metric
    axes[1].plot(history['train_acc'], label='Train')
    axes[1].plot(history['val_acc'], label='Validation')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Accuracy')
    axes[1].legend()
    axes[1].set_title('Accuracy Curves')

    plt.tight_layout()

    if save_path:
        plt.savefig(save_path, dpi=300, bbox_inches='tight')

    return fig

Common Tasks

“Create a new ML project”

Run system detection (Phase 1)
Ask about project type (vision, NLP, tabular, etc.)
Discuss project structure with user (see references/project-structure.md)
Create necessary directories and files based on user’s needs
Set up experiment tracking based on preference
Generate CLAUDE.md and AGENTS.md with project-specific context

“Debug why my model isn’t converging”

Check learning rate (too high? too low?)
Check data normalization
Check for NaN/Inf in gradients
Verify loss function matches task
Check batch size isn’t too small
Look for data issues (labels correct? balanced?)

“Set up W&B for my project”

Check if W&B is installed (pip install wandb)
Verify login status (wandb login)
Generate initialization code
Add logging to training loop
Set up sweep configuration if requested

“Optimize GPU memory usage”

Run memory estimation
Enable mixed precision (AMP)
Add gradient accumulation
Enable gradient checkpointing if needed
Consider batch size reduction

“Create publication figures”

Set publication rcParams
Use colorblind-friendly palette
Export as PDF (300+ DPI)
Verify text is readable at intended size

“Review my experiment for common mistakes”

Run scripts/validate_experiment.py
Check reproducibility setup
Verify no data leakage
Check logging and checkpointing
Review hyperparameters against literature

When to Alert the User

Always notify for these issues:

GPU unavailable

“No GPU detected. Training will run on CPU and be significantly slower.”
Memory constraints

“Model (~X GB) may not fit in GPU (~Y GB). Consider reducing batch size or enabling gradient checkpointing.”
Missing reproducibility

“No random seeds set. Results will not be reproducible. Add seed setting?”
Data leakage detected

“WARNING: Test data IDs found in training set. This will invalidate your results.”
Framework conflicts

“Both PyTorch and TensorFlow detected. Which is the primary framework for this project?”
Deterministic mode tradeoffs

“Enabling deterministic mode for reproducibility. Note: This may slow training by ~10%.”
Checkpoint accumulation

“Found 15GB of checkpoints in outputs/. Clean up old checkpoints?”
Missing experiment tracking

“No experiment tracking detected. Set up W&B or MLflow for proper logging?”

Quality Checklist

Before considering ML setup complete:

System detection run and documented
Reproducibility seeds set
Data splits verified (no leakage)
Experiment tracking configured
Checkpointing enabled
Logging (not print) used throughout
CLAUDE.md documents current experiment
.gitignore excludes data, checkpoints, outputs
Tests exist for data pipeline and model
Memory usage estimated and feasible

See references/checklists.md for complete checklists.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台