ai-fine-tuning

📁 lebsral/dspy-programming-not-prompting-lms-skills 📅 5 days ago
1
总安装量
1
周安装量
#52238
全站排名
安装命令
npx skills add https://github.com/lebsral/dspy-programming-not-prompting-lms-skills --skill ai-fine-tuning

Agent 安装分布

replit 1
opencode 1
cursor 1
github-copilot 1
claude-code 1

Skill 文档

Fine-Tune Models on Your Data

Guide the user through deciding whether to fine-tune, preparing data, running fine-tuning with DSPy, distilling to cheaper models, and deploying. Fine-tuning is powerful but expensive — always confirm prerequisites first.

Should you fine-tune?

Before writing any code, walk through these questions with the user:

  1. Have you optimized prompts first? If not, use /ai-improving-accuracy — prompt optimization is 10x cheaper and often sufficient.
  2. Do you have 500+ labeled examples? Fine-tuning with less data usually overfits. Collect more data first.
  3. Is your baseline accuracy above 50%? If your prompt-optimized program is below 50%, your task definition or data has problems. Fix those first.
  4. What’s the goal — quality or cost?
    • Quality: You’ve maxed out prompt optimization and need more accuracy
    • Cost: You want a small cheap model to match an expensive one

When to fine-tune

  • You’ve already optimized prompts with MIPROv2 and hit a ceiling
  • You have 500+ labeled examples (1000+ is better)
  • Your baseline is >50% and you need to push higher
  • You want to distill an expensive model into a cheaper one (10-50x cost savings)
  • Your domain has specialized vocabulary or patterns the base model doesn’t know
  • You need faster inference (smaller fine-tuned models are faster)

When NOT to fine-tune

  • You haven’t tried prompt optimization yet — start with /ai-improving-accuracy
  • You have fewer than 500 examples — need more data? Use /ai-generating-data to bootstrap synthetic examples, or use BootstrapFewShot or MIPROv2 instead
  • Your baseline is below 50% — your data or task definition needs work
  • You’re still iterating on what the task is — fine-tuning locks you in
  • You don’t have a clear metric — you can’t evaluate fine-tuning without one
  • Your use case changes frequently — fine-tuned models don’t adapt to new instructions easily

Prerequisites checklist

Before starting, confirm:

  • Data: 500+ labeled examples (1000+ recommended), split 80/10/10 (train/dev/test)
  • Baseline: Prompt-optimized program with measured accuracy (use /ai-improving-accuracy)
  • Metric: Clear, automated metric that scores predictions
  • Compute: API access (OpenAI fine-tuning API) or local GPUs (for open-source models)
  • Budget: OpenAI fine-tuning costs ~$0.008/1K tokens for GPT-4o-mini; local needs 1+ GPU

Step 1: Prepare your data and baseline

Build a strong baseline first

Always compare fine-tuning against a prompt-optimized baseline:

import dspy

lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=lm)

# Define your program
class Classify(dspy.Signature):
    """Classify the support ticket."""
    text: str = dspy.InputField()
    category: str = dspy.OutputField()

program = dspy.ChainOfThought(Classify)

# Prepare data
import json
with open("labeled_data.json") as f:
    data = json.load(f)

examples = [dspy.Example(text=x["text"], category=x["category"]).with_inputs("text") for x in data]

# Split: 80% train, 10% dev, 10% test
n = len(examples)
trainset = examples[:int(n * 0.8)]
devset = examples[int(n * 0.8):int(n * 0.9)]
testset = examples[int(n * 0.9):]

# Measure baseline
def metric(example, prediction, trace=None):
    return prediction.category.lower() == example.category.lower()

from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4, display_progress=True)
baseline_score = evaluator(program)
print(f"Baseline: {baseline_score:.1f}%")

Optimize prompts first (your comparison point)

optimizer = dspy.MIPROv2(metric=metric, auto="medium")
prompt_optimized = optimizer.compile(program, trainset=trainset)
prompt_score = evaluator(prompt_optimized)
print(f"Prompt-optimized: {prompt_score:.1f}%")

If prompt optimization gets you to your quality goal, stop here. Fine-tuning is only worth it if you need to go further.

Step 2: BootstrapFinetune (core fine-tuning)

The main fine-tuning workflow in DSPy. It bootstraps successful reasoning traces from your training data, filters them by your metric, and fine-tunes the model weights.

optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
finetuned = optimizer.compile(program, trainset=trainset)

# Evaluate the fine-tuned model
finetuned_score = evaluator(finetuned)
print(f"Baseline:         {baseline_score:.1f}%")
print(f"Prompt-optimized: {prompt_score:.1f}%")
print(f"Fine-tuned:       {finetuned_score:.1f}%")

How it works

  1. Bootstrap traces: Runs your program on each training example, keeping traces where the metric passes
  2. Filter by metric: Only successful traces become training data
  3. Fine-tune weights: Sends traces to the model provider’s fine-tuning API
  4. Return optimized program: The program now uses the fine-tuned model

Requirements

  • A fine-tunable model (OpenAI gpt-4o-mini, gpt-4o; or local open-source models)
  • 500+ training examples (more traces bootstrapped = better fine-tuning)
  • A metric that reliably identifies good outputs

Step 3: Model distillation (expensive to cheap)

Train a small, cheap model to mimic an expensive model. This is the biggest cost saver — 10-50x reduction with 85-95% quality retention.

Teacher-student pattern

# Step 1: Teacher — expensive model, high quality
teacher_lm = dspy.LM("openai/gpt-4o")
dspy.configure(lm=teacher_lm)

# Build and optimize the teacher
teacher = dspy.ChainOfThought(Classify)
optimizer = dspy.MIPROv2(metric=metric, auto="medium")
teacher_optimized = optimizer.compile(teacher, trainset=trainset)

teacher_score = evaluator(teacher_optimized)
print(f"Teacher (GPT-4o): {teacher_score:.1f}%")

# Step 2: Student — fine-tune cheap model on teacher's outputs
student_lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=student_lm)

student = dspy.ChainOfThought(Classify)
ft_optimizer = dspy.BootstrapFinetune(metric=metric, num_threads=24)
student_finetuned = ft_optimizer.compile(student, trainset=trainset, teacher=teacher_optimized)

student_score = evaluator(student_finetuned)
print(f"Student (GPT-4o-mini, fine-tuned): {student_score:.1f}%")

Typical results

Model Quality Cost per 1M tokens
GPT-4o (teacher) 85% ~$5.00
GPT-4o-mini (no tuning) 70% ~$0.15
GPT-4o-mini (fine-tuned) 81% ~$0.15

The fine-tuned student costs 33x less and retains ~95% of teacher quality.

Step 4: BetterTogether (maximum quality)

BetterTogether alternates between prompt optimization and weight optimization, getting more out of both. Based on the BetterTogether paper (arXiv 2407.10930v2), this approach yields 5-78% gains over either technique alone.

optimizer = dspy.BetterTogether(
    metric=metric,
    prompt_optimizer=dspy.MIPROv2,
    weight_optimizer=dspy.BootstrapFinetune,
)
best = optimizer.compile(program, trainset=trainset)

best_score = evaluator(best)
print(f"Prompt-only:    {prompt_score:.1f}%")
print(f"Fine-tune-only: {finetuned_score:.1f}%")
print(f"BetterTogether: {best_score:.1f}%")

How it works

  1. Round 1: Optimize prompts (instructions + few-shot examples)
  2. Round 2: Fine-tune weights using the optimized prompts
  3. Round 3: Re-optimize prompts for the fine-tuned model
  4. Each round builds on the previous, creating synergy between prompt and weight optimization

When to use BetterTogether

  • You want the absolute best quality and have the compute budget
  • Fine-tuning alone didn’t close the gap to your quality target
  • You have 500+ examples and a reliable metric

Step 5: Evaluate and deploy

Thorough evaluation

Always evaluate on the held-out test set (not dev set):

test_evaluator = Evaluate(devset=testset, metric=metric, num_threads=4, display_progress=True)

print(f"Test set results:")
print(f"  Baseline:         {test_evaluator(program):.1f}%")
print(f"  Prompt-optimized: {test_evaluator(prompt_optimized):.1f}%")
print(f"  Fine-tuned:       {test_evaluator(finetuned):.1f}%")

Save and load for production

# Save
finetuned.save("finetuned_program.json")

# Load later
from my_module import MyProgram
production = MyProgram()
production.load("finetuned_program.json")
result = production(text="New support ticket...")

When fine-tuning goes wrong

Can’t bootstrap enough traces

If the base model fails on most training examples, there aren’t enough successful traces to fine-tune on.

Fixes:

  • Use a stronger model for bootstrapping (GPT-4o instead of GPT-4o-mini)
  • Relax your metric during bootstrapping (accept partial credit)
  • Simplify your task (break multi-step into single steps)

Model overfits (high train accuracy, low test accuracy)

Fixes:

  • Add more training data
  • Reduce fine-tuning epochs (if provider allows)
  • Use a larger base model (less prone to overfitting)
  • Simplify your output format

Fine-tuning didn’t improve over prompt optimization

Fixes:

  • Check that bootstrapping produced enough successful traces (need 200+)
  • Try BetterTogether instead of BootstrapFinetune alone
  • Verify your metric actually correlates with quality
  • Try a different base model

Infrastructure choices

OpenAI API (easiest)

Works with gpt-4o-mini and gpt-4o. DSPy handles the fine-tuning API calls automatically:

lm = dspy.LM("openai/gpt-4o-mini")  # fine-tunable via API
  • Pros: No GPU needed, simple setup, fast
  • Cons: Data sent to OpenAI, ongoing per-token costs, limited model choices

Local fine-tuning (own your model)

For open-source models (Llama, Mistral, etc.) using LoRA/QLoRA:

lm = dspy.LM("together_ai/meta-llama/Llama-3-70b-chat-hf")
  • Pros: Data stays private, no per-token costs after training, full control
  • Cons: Needs GPU(s), more setup, slower iteration

Cloud GPU platforms

AWS SageMaker, Google Cloud, Lambda Labs, or Together AI for training:

  • Pros: Scalable, no hardware to manage
  • Cons: Costs vary, setup per platform

Additional resources

  • For worked examples (classification, distillation, BetterTogether), see examples.md
  • Use /ai-improving-accuracy to build a strong baseline before fine-tuning
  • Use /ai-cutting-costs for other cost reduction strategies beyond distillation
  • Use /ai-fixing-errors if fine-tuning or evaluation errors occur