ai-llm

📁 vasilyu1983/ai-agents-public 📅 Jan 23, 2026
27
总安装量
27
周安装量
#7454
全站排名
安装命令
npx skills add https://github.com/vasilyu1983/ai-agents-public --skill ai-llm

Agent 安装分布

claude-code 19
cursor 16
gemini-cli 15
opencode 15
github-copilot 14
codex 14

Skill 文档

LLM Development & Engineering — Complete Reference

Build, evaluate, and deploy LLM systems with modern production standards.

This skill covers the full LLM lifecycle:

  • Development: Strategy selection, dataset design, instruction tuning, PEFT/LoRA fine-tuning
  • Evaluation: Automated testing, LLM-as-judge, metrics, rollout gates
  • Deployment: Serving handoff, latency/cost budgeting, reliability patterns (see ai-llm-inference)
  • Operations: Quality monitoring, change management, incident response (see ai-mlops)
  • Safety: Threat modeling, data governance, layered mitigations (NIST AI RMF: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf)

Modern Best Practices (2026):

  • Treat the model as a component with contracts, budgets, and rollback plans (not “magic”).
  • Separate core concepts (tokenization, context, training vs adaptation) from implementation choices (providers, SDKs).
  • Gate upgrades with repeatable evals and staged rollout; avoid blind model swaps.
  • Cost-aware engineering: Measure cost per successful outcome, not just cost per token; design tiering/caching early.
  • Security-by-design: Threat model prompt injection, data leakage, and tool abuse; treat guardrails as production code.

For detailed patterns: See Resources and Templates sections below.


Quick Reference

Task Tool/Framework Command/Pattern When to Use
Choose architecture Prompt vs RAG vs fine-tune Start simple; add retrieval/adaptation only if needed New products and migrations
Model selection Scoring matrix Quality/latency/cost/privacy/license weighting Provider changes and procurement
Cost optimization Tiered models + caching Cascade routing, prompt caching, budget guardrails Cost-sensitive production
Fine-tuning ROI ROI calculator Break-even analysis, TCO comparison Investment decisions
Prompt contracts Structured output + constraints JSON schema, max tokens, refusal rules Reliability and integration
RAG integration Hybrid retrieval + grounding Retrieve → rerank → pack → cite → verify Fresh/large corpora, traceability
Fine-tuning PEFT/LoRA (when justified) Small targeted datasets + regression suite Stable domains, repeated tasks
Evaluation Offline + online Golden sets + A/B + canary + monitoring Prevent regressions and drift

Decision Tree: LLM System Architecture

Building LLM application: [Architecture Selection]
    ├─ Need current knowledge?
    │   ├─ Simple Q&A? → Basic RAG (page-level chunking + hybrid retrieval)
    │   └─ Complex retrieval? → Advanced RAG (reranking + contextual retrieval)
    │
    ├─ Need tool use / actions?
    │   ├─ Single task? → Simple agent (ReAct pattern)
    │   └─ Multi-step workflow? → Multi-agent (LangGraph, CrewAI)
    │
    ├─ Static behavior sufficient?
    │   ├─ Quick MVP? → Prompt engineering (CI/CD integrated)
    │   └─ Production quality? → Fine-tuning (PEFT/LoRA)
    │
    └─ Best results?
        └─ Hybrid (RAG + Fine-tuning + Agents) → Comprehensive solution

See Decision Matrices for detailed selection criteria.


Cost-Quality Decision Framework

LLM spend is driven by usage-based inference (tokens/requests) plus supporting infra and engineering. Model selection is a cost-quality-latency-risk tradeoff.

Model Tier Strategy

| Tier | Typical profile | Use For | |——|——–|——|———| | Value | Small/fast models | High-volume, simple tasks | | Balanced | General-purpose models | Most production workloads | | Premium | Frontier/large models | Hardest tasks, low volume |

Cost Optimization Levers

  1. Model tiering: Route simple requests to cheaper models (often large savings at scale)
  2. Prompt caching: Reuse stable prefixes/context (provider-specific discounts and constraints)
  3. Prompt optimization: Compress examples and instructions (typically meaningful token reduction)
  4. Output limits: Set appropriate max_tokens (prevents runaway costs)

When to Fine-Tune (ROI-Based)

Fine-tuning pays off when:

  • Volume justifies it: >10k requests/month provides meaningful cost savings
  • Domain is stable: Requirements unchanged for >6 months
  • Data exists: >1,000 quality training examples available
  • Break-even achievable: <12 months to recover investment

See Cost Economics for TCO modeling and Fine-Tuning ROI Calculator for investment analysis.


Core Concepts (Vendor-Agnostic)

  • Model classes: encoder-only, decoder-only, encoder-decoder, multimodal; choose based on task and latency.
  • Tokenization & limits: context window, max output, and prompt/template overhead drive both cost and tail latency.
  • Adaptation options: prompting → retrieval → adapters (LoRA) → full fine-tune; choose by stability and ROI (LoRA: https://arxiv.org/abs/2106.09685).
  • Evaluation: metrics must map to user value; report uncertainty and slice performance, not only global averages.
  • Governance: data retention, residency, licensing, and auditability are product requirements (EU AI Act: https://eur-lex.europa.eu/eli/reg/2024/1689/oj; NIST GenAI Profile: https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf).

Implementation Practices (Tooling Examples)

  • Use a provider abstraction (gateway/router) to enable fallbacks and staged upgrades.
  • Instrument requests with tokens, latency, and error classes (OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/).
  • Maintain prompt/model registries with versioning, changelogs, and rollback criteria.

Do / Avoid

Do

  • Do pin model + prompt versions in production, and re-run evals before any change.
  • Do enforce budgets at the boundary: max tokens, max tools, max retries, max cost.
  • Do plan for degraded modes (smaller model, cached answers, “unable to answer”).

Avoid

  • Avoid model sprawl (unowned variants with no eval coverage).
  • Avoid blind upgrades based on anecdotal quality; require measured impact.
  • Avoid training on production logs without consent, governance, and leakage controls.

When to Use This Skill

Claude should invoke this skill when the user asks about:

  • LLM preflight/project checklists, production best practices, or data pipelines
  • Building or deploying RAG, agentic, or prompt-based LLM apps
  • Prompt design, chain-of-thought (CoT), ReAct, or template patterns
  • Troubleshooting LLM hallucination, bias, retrieval issues, or production failures
  • Evaluating LLMs: benchmarks, multi-metric eval, or rollout/monitoring
  • LLMOps: deployment, rollback, scaling, resource optimization
  • Technology stack selection (models, vector DBs, frameworks)
  • Production deployment strategies and operational patterns

Scope Boundaries (Use These Skills for Depth)


Resources (Best Practices & Operational Patterns)

Comprehensive operational guides with checklists, patterns, and decision frameworks:

Core Operational Patterns

  • Cost Economics & Decision Frameworks – Cost modeling, unit economics, TCO analysis

    • Pricing/discount assumptions (verify against current provider docs)
    • Cost-quality tradeoff framework and decision matrix
    • Total Cost of Ownership (TCO) calculation
    • Fine-tuning ROI framework and break-even analysis
    • Prompt caching economics
    • Cost monitoring and budget guardrails
  • Project Planning Patterns – Stack selection, FTI pipeline, performance budgeting

    • AI engineering stack selection matrix
    • Feature/Training/Inference (FTI) pipeline blueprint
    • Performance budgeting and goodput gates
    • Progressive complexity (prompt → RAG → fine-tune → hybrid)
  • Production Checklists – Pre-deployment validation and operational checklists

    • LLM lifecycle checklist (modern production standards)
    • Data & training, RAG pipeline, deployment & serving
    • Safety/guardrails, evaluation, agentic systems
    • Reliability & data infrastructure (DDIA-grade)
    • Weekly production tasks
  • Common Design Patterns – Copy-paste ready implementation examples

    • Chain-of-Thought (CoT) prompting
    • ReAct (Reason + Act) pattern
    • RAG pipeline (minimal to advanced)
    • Agentic planning loop
    • Self-reflection and multi-agent collaboration
  • Decision Matrices – Quick reference tables for selection

    • RAG type decision matrix (naive → advanced → modular)
    • Production evaluation table with targets and actions
    • Model selection matrix (tier-based, vendor-agnostic)
    • Vector database, embedding model, framework selection
    • Deployment strategy matrix
  • Anti-Patterns – Common mistakes and prevention strategies

    • Data leakage, prompt dilution, RAG context overload
    • Agentic runaway, over-engineering, ignoring evaluation
    • Hard-coded prompts, missing observability
    • Detection methods and prevention code examples

Domain-Specific Patterns

Note: Each resource file includes preflight/validation checklists, copy-paste reference tables, inline templates, anti-patterns, and decision matrices.


Templates (Copy-Paste Ready)

Production templates by use case and technology:

Selection & Governance

RAG Pipelines

  • Basic RAG – Simple retrieval-augmented generation
  • Advanced RAG – Hybrid retrieval, reranking, contextual embeddings

Prompt Engineering

Agentic Workflows

Data Pipelines

Deployment

Evaluation


Shared Utilities (Centralized patterns — extract, don’t duplicate)


Trend Awareness Protocol

IMPORTANT: For “best/latest” recommendations, verify recency using current sources (official docs/release notes/benchmarks). If you can’t browse, state assumptions and ask for timeframe + constraints.

Trigger Conditions

  • “What’s the best LLM model for [use case]?”
  • “What should I use for [RAG/fine-tuning/agents]?”
  • “What’s the latest in LLM development?”
  • “Current best practices for [prompting/evaluation/deployment]?”
  • “Is [model/framework] still relevant in 2026?”
  • “[Model A] vs [Model B]?” or “[Framework A] vs [Framework B]?”
  • “Best vector database for [use case]?”
  • “What agent framework should I use?”

Minimal Verification Checklist

  1. Confirm user constraints: latency, cost, privacy/compliance, deployment target, and toolchain.
  2. Check at least 2 authoritative sources from data/sources.json (provider docs, release notes, pricing/quotas, deprecations).
  3. Prefer stable guidance (tradeoffs + decision criteria) over “one best model/framework”.

What to Report

After searching, provide:

  • Current landscape: What models/frameworks are popular NOW (not 6 months ago)
  • Emerging trends: New models, frameworks, or techniques gaining traction
  • Deprecated/declining: Models/frameworks losing relevance or support
  • Recommendation: Based on fresh data, not just static knowledge

Example Topics (verify with fresh sources)

  • Latest frontier models (GPT-4.5, Claude 4, Gemini 2.x, Llama 4)
  • Agent frameworks (LangGraph, CrewAI, AutoGen, Semantic Kernel)
  • Vector databases (Pinecone, Qdrant, Weaviate, pgvector)
  • RAG techniques (contextual retrieval, agentic RAG, graph RAG)
  • Inference engines (vLLM, TensorRT-LLM, SGLang)
  • Evaluation frameworks (RAGAS, DeepEval, Braintrust)

Related Skills

This skill integrates with complementary Claude Code skills:

Core Dependencies

  • ai-rag – Retrieval pipelines: chunking, hybrid search, reranking, evaluation
  • ai-prompt-engineering – Systematic prompt design, evaluation, testing, and optimization
  • ai-agents – Agent architectures, tool use, multi-agent systems, autonomous workflows

Production & Operations

  • ai-llm-inference – Production serving, quantization, batching, GPU optimization
  • ai-mlops – Deployment, monitoring, incident response, security, and governance

External Resources

See data/sources.json for 50+ curated authoritative sources:

  • Official LLM platform docs – OpenAI, Anthropic, Gemini, Mistral, Azure OpenAI, AWS Bedrock
  • Open-source models and frameworks – HuggingFace Transformers, open-weight models, PEFT/LoRA, distributed training/inference stacks
  • RAG frameworks and vector DBs – LlamaIndex, LangChain 1.2+, LangGraph, LangGraph Studio v2, Haystack, Pinecone, Qdrant, Chroma
  • Agent frameworks (examples) – LangGraph, Semantic Kernel, AutoGen, CrewAI
  • RAG innovations (examples) – Graph-based retrieval, hybrid retrieval, online evaluation loops
  • Prompt engineering – Anthropic Prompt Library, Prompt Engineering Guide, CoT/ReAct patterns
  • Evaluation and monitoring – OpenAI Evals, HELM, Anthropic Evals, LangSmith, W&B, Arize Phoenix
  • Production deployment – Model gateways/routers, self-hosted serving, managed endpoints

Usage

For New Projects

  1. Start with Production Checklists – Validate all pre-deployment requirements
  2. Use Decision Matrices – Select technology stack
  3. Reference Project Planning Patterns – Design FTI pipeline
  4. Implement with Common Design Patterns – Copy-paste code examples
  5. Avoid Anti-Patterns – Learn from common mistakes

For Troubleshooting

  1. Check Anti-Patterns – Identify failure modes and mitigations
  2. Use Decision Matrices – Evaluate if architecture fits use case
  3. Reference Common Design Patterns – Verify implementation correctness

For Ongoing Operations

  1. Follow Production Checklists – Weekly operational tasks
  2. Integrate Evaluation Patterns – Continuous quality monitoring
  3. Apply LLMOps Best Practices – Deployment and rollback procedures

Navigation Summary

Quick Decisions: Decision Matrices Pre-Deployment: Production Checklists Planning: Project Planning Patterns Implementation: Common Design Patterns Troubleshooting: Anti-Patterns

Domain Depth: LLMOps | Evaluation | Prompts | Agents | RAG

Templates: assets/ – Copy-paste ready production code

Sources: data/sources.json – Authoritative documentation links