hugging-face-model-trainer
npx skills add https://github.com/alejandro-ao/hf-skills --skill hugging-face-model-trainer
Agent 安装分布
Skill 文档
TRL Training on Hugging Face Jobs
Overview
Train language models using TRL (Transformer Reinforcement Learning) on fully managed Hugging Face infrastructure. No local GPU setup requiredâmodels train on cloud GPUs and results are automatically saved to the Hugging Face Hub.
TRL provides multiple training methods:
- SFT (Supervised Fine-Tuning) – Standard instruction tuning
- DPO (Direct Preference Optimization) – Alignment from preference data
- GRPO (Group Relative Policy Optimization) – Online RL training
- Reward Modeling – Train reward models for RLHF
For detailed TRL method documentation:
hf_doc_search("your query", product="trl")
hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer") # SFT
hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer") # DPO
# etc.
See also: references/training_methods.md for method overviews and selection guidance
When to Use This Skill
Use this skill when users want to:
- Fine-tune language models on cloud GPUs without local infrastructure
- Train with TRL methods (SFT, DPO, GRPO, etc.)
- Run training jobs on Hugging Face Jobs infrastructure
- Convert trained models to GGUF for local deployment (Ollama, LM Studio, llama.cpp)
- Ensure trained models are permanently saved to the Hub
- Use modern workflows with optimized defaults
When to Use Unsloth
Use Unsloth (references/unsloth.md) instead of standard TRL when:
- Limited GPU memory – Unsloth uses ~60% less VRAM
- Speed matters – Unsloth is ~2x faster
- Training large models (>13B) – memory efficiency is critical
- Training Vision-Language Models (VLMs) – Unsloth has
FastVisionModelsupport
See references/unsloth.md for complete Unsloth documentation and scripts/unsloth_sft_example.py for a production-ready training script.
Key Directives
When assisting with training jobs:
-
ALWAYS use
hf_jobs()MCP tool – Submit jobs usinghf_jobs("uv", {...}). Ifhf_jobs()is unavailable, do not switch to HF CLI fallback; guide the user to install/authenticate the Hugging Face MCP server first. -
Run auth and permissions preflight before expensive jobs – Check authentication and permissions first. Verify token scopes include Jobs write permission to avoid 403 failures after script preparation.
-
Always include Trackio – Every training script should include Trackio for real-time monitoring. Use example scripts in
scripts/as templates. -
Provide job details after submission – After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
-
Use example scripts as templates – Reference
scripts/train_sft_example.py,scripts/train_dpo_example.py, etc. as starting points.
Local Script Dependencies
To run scripts locally (like estimate_cost.py), install dependencies:
pip install -r requirements.txt
Prerequisites Checklist
Before starting any training job, verify:
â Account & Authentication
- Hugging Face Account with Pro, Team, or Enterprise plan (Jobs require paid plan)
- Authenticated login: Check with
hf_whoami() - HF_TOKEN for Hub Push â ï¸ CRITICAL – Training environment is ephemeral, must push to Hub or ALL training results are lost
- Token must have write permissions
- MUST pass
secrets={"HF_TOKEN": "$HF_TOKEN"}in job config to make token available (the$HF_TOKENsyntax references your actual token value) - Before long jobs, run a quick MCP probe to verify
job.writepermission:hf_whoami()hf_jobs("uv", {"script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py", "script_args": ["--dataset", "trl-lib/Capybara", "--split", "train"], "flavor": "cpu-basic", "timeout": "15m"})- If you get
403 ... missing permissions: job.write, rotate/re-login with a token that includes Jobs write permission
â MCP Server Availability
- This skill requires the Hugging Face MCP server for job submission.
- If
hf_jobs()is missing, install and authenticate the official Hugging Face MCP server, then retry. - Claude Code install (OAuth login flow):
claude mcp add hf-mcp-server -t http https://huggingface.co/mcp?login- Restart
claudeand complete authentication in the prompted flow.
- Claude Code install (token header):
claude mcp add hf-mcp-server -t http https://huggingface.co/mcp -H "Authorization: Bearer <YOUR_HF_TOKEN>"
- Verify connectivity after setup:
hf_whoami()hf_jobs("ps")
â Dataset Requirements
- Dataset must exist on Hub or be loadable via
datasets.load_dataset() - Format must match training method (SFT: “messages”/text/prompt-completion; DPO: chosen/rejected; GRPO: prompt-only)
- ALWAYS validate unknown datasets before GPU training to prevent format failures (see Dataset Validation section below)
- Size appropriate for hardware (Demo: 50-100 examples on t4-small; Production: 1K-10K+ on a10g-large/a100-large)
â ï¸ Critical Settings
- Timeout must exceed expected end-to-end job time – Default 30min is TOO SHORT for most training. Include setup overhead (dependency install, dataset loading/tokenization, checkpoint upload, final Hub push, Trackio sync). For first runs on a new dataset/hardware combo, use a larger buffer.
- Hub push must be enabled – Config:
push_to_hub=True,hub_model_id="username/model-name"; Job:secrets={"HF_TOKEN": "$HF_TOKEN"}
Asynchronous Job Guidelines
â ï¸ IMPORTANT: Training jobs run asynchronously and can take hours
Action Required
When user requests training:
- Run preflight checks first (auth + permissions probe)
- Create the training script with Trackio included (use
scripts/train_sft_example.pyas template) - Submit immediately using
hf_jobs("uv", {...})with inline script content - Report submission with job ID, monitoring URL, and estimated time
- Wait for user to request status checks – don’t poll automatically
Ground Rules
- Jobs run in background – Submission returns immediately; training continues independently
- Initial logs delayed – Can take 30-60 seconds for logs to appear
- User checks status – Wait for user to request status updates
- Avoid polling – Check logs only on user request; provide monitoring links instead
After Submission
Provide to user:
- â Job ID and monitoring URL
- â Expected completion time
- â Trackio dashboard URL
- â Note that user can request status checks later
Example Response:
â
Job submitted successfully!
Job ID: abc123xyz
Monitor: https://huggingface.co/jobs/username/abc123xyz
Expected time: ~2 hours
Estimated cost: ~$10
The job is running in the background. Ask me to check status/logs when ready!
Quick Start: Three Approaches
ð¡ Tip for Demos: For quick demos on smaller GPUs (t4-small), omit eval_dataset and eval_strategy to save ~40% memory. You’ll still see training loss and learning progress.
Sequence Length Configuration
TRL config classes use max_length (not max_seq_length) to control tokenized sequence length:
# â
CORRECT - If you need to set sequence length
SFTConfig(max_length=512) # Truncate sequences to 512 tokens
DPOConfig(max_length=2048) # Longer context (2048 tokens)
# â WRONG - This parameter doesn't exist
SFTConfig(max_seq_length=512) # TypeError!
Default behavior: max_length=1024 (truncates from right). This works well for most training.
When to override:
- Longer context: Set higher (e.g.,
max_length=2048) - Memory constraints: Set lower (e.g.,
max_length=512) - Vision models: Set
max_length=None(prevents cutting image tokens)
Usually you don’t need to set this parameter at all – the examples below use the sensible default.
Approach 1: UV Scripts (RecommendedâDefault Choice)
UV scripts use PEP 723 inline dependencies for clean, self-contained training. This is the primary approach for Claude Code.
hf_jobs("uv", {
"script": """
# /// script
# dependencies = ["trl>=0.12.0", "peft>=0.7.0", "trackio"]
# ///
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
import trackio
dataset = load_dataset("trl-lib/Capybara", split="train")
# Create train/eval split for monitoring
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset_split["train"],
eval_dataset=dataset_split["test"],
peft_config=LoraConfig(r=16, lora_alpha=32),
args=SFTConfig(
output_dir="my-model",
push_to_hub=True,
hub_model_id="username/my-model",
num_train_epochs=3,
eval_strategy="steps",
eval_steps=50,
report_to="trackio",
project="meaningful_prject_name", # project name for the training name (trackio)
run_name="meaningful_run_name", # descriptive name for the specific training run (trackio)
)
)
trainer.train()
trainer.push_to_hub()
""",
"flavor": "a10g-large",
"timeout": "2h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
Benefits: Direct MCP tool usage, clean code, dependencies declared inline (PEP 723), no file saving required, full control
When to use: Default choice for all training tasks in Claude Code, custom training logic, any scenario requiring hf_jobs()
Working with Scripts
â ï¸ Important: The script parameter accepts either inline code (as shown above) OR a URL. Local file paths do NOT work.
Why local paths don’t work: Jobs run in isolated Docker containers without access to your local filesystem. Scripts must be:
- Inline code (recommended for custom training)
- Publicly accessible URLs
- Private repo URLs (with HF_TOKEN)
Common mistakes:
# â These will all fail
hf_jobs("uv", {"script": "train.py"})
hf_jobs("uv", {"script": "./scripts/train.py"})
hf_jobs("uv", {"script": "/path/to/train.py"})
Correct approaches:
# â
Inline code (recommended)
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<your code>"})
# â
From Hugging Face Hub
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/train.py"})
# â
From GitHub
hf_jobs("uv", {"script": "https://raw.githubusercontent.com/user/repo/main/train.py"})
# â
From Gist
hf_jobs("uv", {"script": "https://gist.githubusercontent.com/user/id/raw/train.py"})
To use local scripts: Upload to HF Hub first:
huggingface-cli repo create my-training-scripts --type model
huggingface-cli upload my-training-scripts ./train.py train.py
# Use: https://huggingface.co/USERNAME/my-training-scripts/resolve/main/train.py
Approach 2: TRL Maintained Scripts (Official Examples)
TRL provides battle-tested scripts for all methods. Can be run from URLs:
hf_jobs("uv", {
"script": "https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py",
"script_args": [
"--model_name_or_path", "Qwen/Qwen2.5-0.5B",
"--dataset_name", "trl-lib/Capybara",
"--output_dir", "my-model",
"--push_to_hub",
"--hub_model_id", "username/my-model"
],
"flavor": "a10g-large",
"timeout": "2h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
Benefits: No code to write, maintained by TRL team, production-tested When to use: Standard TRL training, quick experiments, don’t need custom code Available: Scripts are available from https://github.com/huggingface/trl/tree/main/examples/scripts
Finding More UV Scripts on Hub
The uv-scripts organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
# Discover available UV script collections
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
# Explore a specific collection
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
Popular collections: ocr, classification, synthetic-data, vllm, dataset-creation
Approach 3: TRL Jobs Package (Simplified Training)
The trl-jobs package provides optimized defaults and one-liner training.
# Install
pip install trl-jobs
# Train with SFT (simplest possible)
trl-jobs sft \
--model_name Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/Capybara
Benefits: Pre-configured settings, automatic Trackio integration, automatic Hub push, one-line commands When to use: User working in terminal directly (not Claude Code context), quick local experimentation Repository: https://github.com/huggingface/trl-jobs
â ï¸ In Claude Code context, use hf_jobs() MCP tool (Approach 1). If unavailable, install/authenticate Hugging Face MCP server first.
Hardware Selection
| Model Size | Recommended Hardware | Cost (approx/hr) | Use Case |
|---|---|---|---|
| <1B params | t4-small |
~$0.75 | Demos, quick tests only without eval steps |
| 1-3B params | t4-medium, l4x1 |
~$1.50-2.50 | Development |
| 3-7B params | a10g-small, a10g-large |
~$3.50-5.00 | Production training |
| 7-13B params | a10g-large, a100-large |
~$5-10 | Large models (use LoRA) |
| 13B+ params | a100-large, a10g-largex2 |
~$10-20 | Very large (use LoRA) |
GPU Flavors: cpu-basic/upgrade/performance/xl, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8
Guidelines:
- Use LoRA/PEFT for models >7B to reduce memory
- Multi-GPU automatically handled by TRL/Accelerate
- Start with smaller hardware for testing
See: references/hardware_guide.md for detailed specifications
Critical: Saving Results to Hub
â ï¸ EPHEMERAL ENVIRONMENTâMUST PUSH TO HUB
The Jobs environment is temporary. All files are deleted when the job ends. If the model isn’t pushed to Hub, ALL TRAINING IS LOST.
Required Configuration
In training script/config:
SFTConfig(
push_to_hub=True,
hub_model_id="username/model-name", # MUST specify
hub_strategy="every_save", # Optional: push checkpoints
)
In job submission:
{
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
}
Verification Checklist
Before submitting:
-
push_to_hub=Trueset in config -
hub_model_idincludes username/repo-name -
secretsparameter includes HF_TOKEN - User has write access to target repo
See: references/hub_saving.md for detailed troubleshooting
Timeout Management
â ï¸ DEFAULT: 30 MINUTESâTOO SHORT FOR TRAINING
Setting Timeouts
{
"timeout": "2h" # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
}
Timeout Guidelines
| Scenario | Recommended | Notes |
|---|---|---|
| Quick demo (50-100 examples) | 30-60 min | Include dependency install and setup |
| Development training | 2-3 hours | Small datasets with preprocessing |
Budget run on t4-small (5k-10k examples) |
3-4 hours | Include tokenization + upload overhead |
| Production (3-7B model) | 4-6 hours | Full datasets |
| Large model with LoRA | 3-6 hours | Depends on dataset |
Use at least 50% buffer for first runs on a new dataset/hardware configuration.
After measuring real runtime for that setup, reduce to 30% buffer if stable.
On timeout: Job killed immediately, all unsaved progress lost, must restart from beginning
Cost Estimation
Offer to estimate cost when planning jobs with known parameters. Use scripts/estimate_cost.py:
uv run scripts/estimate_cost.py \
--model meta-llama/Llama-2-7b-hf \
--dataset trl-lib/Capybara \
--hardware a10g-large \
--dataset-size 16000 \
--epochs 3
Output includes estimated time, cost, recommended timeout (with buffer), and optimization suggestions.
When to offer: User planning a job, asks about cost/time, choosing hardware, job will run >1 hour or cost >$5
Example Training Scripts
Production-ready templates with all best practices:
Load these scripts for correctly:
scripts/train_sft_example.py– Complete SFT training with Trackio, LoRA, checkpointsscripts/train_dpo_example.py– DPO training for preference learningscripts/train_grpo_example.py– GRPO training for online RL
These scripts demonstrate proper Hub saving, Trackio integration, checkpoint management, and optimized parameters. Pass their content inline to hf_jobs() or use as templates for custom scripts.
Monitoring and Tracking
Trackio provides real-time metrics visualization. See references/trackio_guide.md for complete setup guide.
Key points:
- Add
trackioto dependencies - Initialize Trackio explicitly with
trackio.init(project=..., name=..., space_id=...) - Configure trainer with
report_to="trackio"andrun_name="meaningful_name" - Call
trackio.finish()after training/evaluation
Trackio Configuration Defaults
Use sensible defaults unless user specifies otherwise. When generating training scripts with Trackio:
Default Configuration:
- Space ID:
{username}/trackio(use “trackio” as default space name) - Run naming: Unless otherwise specified, name the run in a way the user will recognize (e.g., descriptive of the task, model, or purpose)
- Config: Keep minimal – only include hyperparameters and model/dataset info
- Project Name: Use a Project Name to associate runs with a particular Project
Canonical setup snippet:
import trackio
trackio.init(
project="my-project",
name="my-run", # `trackio.init` uses `name`, not `run_name`
space_id="username/trackio",
config={"model": "Qwen/Qwen2.5-0.5B", "dataset": "my-dataset"},
)
# Trainer config still uses run_name:
SFTConfig(report_to="trackio", project="my-project", run_name="my-run")
User overrides: If user requests specific trackio configuration (custom space, run naming, grouping, or additional config), apply their preferences instead of defaults.
This is useful for managing multiple jobs with the same configuration or keeping training scripts portable.
See references/trackio_guide.md for complete documentation including grouping runs for experiments.
Check Job Status
# List all jobs
hf_jobs("ps")
# Inspect specific job
hf_jobs("inspect", {"job_id": "your-job-id"})
# View logs
hf_jobs("logs", {"job_id": "your-job-id"})
Remember: Wait for user to request status checks. Avoid polling repeatedly.
Evaluation Jobs (Held-Out)
Use scripts/eval_sft_adapter_example.py to evaluate a fine-tuned adapter on a held-out split and report eval_loss and perplexity.
Important: SFTTrainer expects train_dataset during initialization. For eval-only workflows, pass the same held-out dataset to both train_dataset and eval_dataset, then call trainer.evaluate().
Dataset Validation
Validate dataset format BEFORE launching GPU training to prevent the #1 cause of training failures: format mismatches.
Why Validate
- 50%+ of training failures are due to dataset format issues
- DPO especially strict: requires exact column names (
prompt,chosen,rejected) - Failed GPU jobs waste $1-10 and 30-60 minutes
- Validation on CPU costs ~$0.01 and takes <1 minute
When to Validate
ALWAYS validate for:
- Unknown or custom datasets
- DPO training (CRITICAL – 90% of datasets need mapping)
- Any dataset not explicitly TRL-compatible
Skip validation for known TRL datasets:
trl-lib/ultrachat_200k,trl-lib/Capybara,HuggingFaceH4/ultrachat_200k, etc.
Usage
hf_jobs("uv", {
"script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
"script_args": ["--dataset", "username/dataset-name", "--split", "train"]
})
The script is fast, and will usually complete synchronously.
Reading Results
The output shows compatibility for each training method:
â READY– Dataset is compatible, use directlyâ NEEDS MAPPING– Compatible but needs preprocessing (mapping code provided)â INCOMPATIBLE– Cannot be used for this method
When mapping is needed, the output includes a “MAPPING CODE” section with copy-paste ready Python code.
Example Workflow
# 1. Inspect dataset (costs ~$0.01, <1 min on CPU)
hf_jobs("uv", {
"script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
"script_args": ["--dataset", "argilla/distilabel-math-preference-dpo", "--split", "train"]
})
# 2. Check output markers:
# â READY â proceed with training
# â NEEDS MAPPING â apply mapping code below
# â INCOMPATIBLE â choose different method/dataset
# 3. If mapping needed, apply before training:
def format_for_dpo(example):
return {
'prompt': example['instruction'],
'chosen': example['chosen_response'],
'rejected': example['rejected_response'],
}
dataset = dataset.map(format_for_dpo, remove_columns=dataset.column_names)
# 4. Launch training job with confidence
Common Scenario: DPO Format Mismatch
Most DPO datasets use non-standard column names. Example:
Dataset has: instruction, chosen_response, rejected_response
DPO expects: prompt, chosen, rejected
The validator detects this and provides exact mapping code to fix it.
Converting Models to GGUF
After training, convert models to GGUF format for use with llama.cpp, Ollama, LM Studio, and other local inference tools.
What is GGUF:
- Optimized for CPU/GPU inference with llama.cpp
- Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
- Compatible with Ollama, LM Studio, Jan, GPT4All, llama.cpp
- Typically 2-8GB for 7B models (vs 14GB unquantized)
When to convert:
- Running models locally with Ollama or LM Studio
- Reducing model size with quantization
- Deploying to edge devices
- Sharing models for local-first use
See: references/gguf_conversion.md for complete conversion guide, including production-ready conversion script, quantization options, hardware requirements, usage examples, and troubleshooting.
Quick conversion:
hf_jobs("uv", {
"script": "<see references/gguf_conversion.md for complete script>",
"flavor": "a10g-large",
"timeout": "45m",
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
"env": {
"ADAPTER_MODEL": "username/my-finetuned-model",
"BASE_MODEL": "Qwen/Qwen2.5-0.5B",
"OUTPUT_REPO": "username/my-model-gguf"
}
})
Common Training Patterns
See references/training_patterns.md for detailed examples including:
- Quick demo (5-10 minutes)
- Production with checkpoints
- Multi-GPU training
- DPO training (preference learning)
- GRPO training (online RL)
Common Failure Modes
Out of Memory (OOM)
Fix (try in order):
- Reduce batch size:
per_device_train_batch_size=1, increasegradient_accumulation_steps=8. Effective batch size isper_device_train_batch_sizexgradient_accumulation_steps. For best performance keep effective batch size close to 128. - Enable:
gradient_checkpointing=True - Upgrade hardware: t4-small â l4x1, a10g-small â a10g-large etc.
Dataset Misformatted
Fix:
- Validate first with dataset inspector:
hf_jobs("uv", { "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py", "script_args": ["--dataset", "name", "--split", "train"] }) - Check output for compatibility markers (â READY, â NEEDS MAPPING, â INCOMPATIBLE)
- Apply mapping code from inspector output if needed
Job Timeout
Fix:
- Check logs for actual runtime:
hf_jobs("logs", {"job_id": "..."}) - Increase timeout with buffer:
"timeout": "3h"(add 50% for first run, 30% for repeated stable runs) - Or reduce training: lower
num_train_epochs, use smaller dataset, enablemax_steps - Save checkpoints:
save_strategy="steps",save_steps=500,hub_strategy="every_save"
Note: Default 30min is insufficient for real training. Minimum 1-2 hours.
Jobs Permission Errors (403)
Fix:
- Check identity:
hf_whoami() - Run a quick CPU probe job to verify permissions before GPU job:
hf_jobs("uv", { "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py", "script_args": ["--dataset", "trl-lib/Capybara", "--split", "train"], "flavor": "cpu-basic", "timeout": "15m", }) - If error includes
missing permissions: job.write, create/update token with Jobs write scope and re-login - Retry submission only after preflight succeeds
Hub Push Failures
Fix:
- Add to job:
secrets={"HF_TOKEN": "$HF_TOKEN"} - Add to config:
push_to_hub=True,hub_model_id="username/model-name" - Verify auth:
mcp__huggingface__hf_whoami() - Check token has write permissions and repo exists (or set
hub_private_repo=True)
Missing Dependencies
Fix: Add to PEP 723 header:
# /// script
# dependencies = ["trl>=0.12.0", "peft>=0.7.0", "trackio", "missing-package"]
# ///
Troubleshooting
Common issues:
- Job times out â Increase timeout, reduce epochs/dataset, use smaller model/LoRA
- Model not saved to Hub â Check push_to_hub=True, hub_model_id, secrets=HF_TOKEN
- Out of Memory (OOM) â Reduce batch size, increase gradient accumulation, enable LoRA, use larger GPU
- Dataset format error â Validate with dataset inspector (see Dataset Validation section)
- Import/module errors â Add PEP 723 header with dependencies, verify format
- Authentication errors â Check
mcp__huggingface__hf_whoami(), token permissions, secrets parameter - Jobs permission errors (403) â Run MCP preflight (
hf_whoami()+ CPU probe), verify token includesjob.write
See: references/troubleshooting.md for complete troubleshooting guide
Resources
References (In This Skill)
references/training_methods.md– Overview of SFT, DPO, GRPO, KTO, PPO, Reward Modelingreferences/training_patterns.md– Common training patterns and examplesreferences/unsloth.md– Unsloth for fast VLM training (~2x speed, 60% less VRAM)references/gguf_conversion.md– Complete GGUF conversion guidereferences/trackio_guide.md– Trackio monitoring setupreferences/hardware_guide.md– Hardware specs and selectionreferences/hub_saving.md– Hub authentication troubleshootingreferences/troubleshooting.md– Common issues and solutions
Scripts (In This Skill)
scripts/train_sft_example.py– Production SFT templatescripts/train_dpo_example.py– Production DPO templatescripts/train_grpo_example.py– Production GRPO templatescripts/unsloth_sft_example.py– Unsloth text LLM training template (faster, less VRAM)scripts/eval_sft_adapter_example.py– Evaluate SFT adapter on held-out split (eval loss + perplexity)scripts/estimate_cost.py– Estimate time and cost (offer when appropriate)scripts/convert_to_gguf.py– Complete GGUF conversion script
External Scripts
- Dataset Inspector – Validate dataset format before training (use via
hf_jobs)
External Links
- TRL Documentation
- TRL Jobs Training Guide
- TRL Jobs Package
- HF Jobs Documentation
- TRL Example Scripts
- UV Scripts Guide
- UV Scripts Organization
Key Takeaways
- Use MCP only for Jobs submission – if
hf_jobs()is unavailable, install/authenticate Hugging Face MCP server before proceeding - Jobs are asynchronous – Don’t wait/poll; let user check when ready
- Run auth + permission preflight first – verify token and
job.writebefore launching GPU jobs - Always set timeout – Default 30 min is insufficient; include setup + upload overhead and use larger first-run buffer
- Always enable Hub push – Environment is ephemeral; without push, all results lost
- Include Trackio – Call
trackio.init(..., name=..., space_id=...), configure trainer withreport_to="trackio", and finish withtrackio.finish() - Offer cost estimation – When parameters are known, use
scripts/estimate_cost.py - Use UV scripts (Approach 1) – Default to
hf_jobs("uv", {...})with inline scripts; TRL maintained scripts for standard training; avoid bashtrl-jobscommands in Claude Code - Use hf_doc_fetch/hf_doc_search for latest TRL documentation
- Validate dataset format before training with dataset inspector (see Dataset Validation section)
- Choose appropriate hardware for model size; use LoRA for models >7B