llm-testing

📁 yonatangross/orchestkit 📅 Jan 22, 2026

总安装量

周安装量

#26805

全站排名

安装命令

npx skills add https://github.com/yonatangross/orchestkit --skill llm-testing

Agent 安装分布

claude-code 9

opencode 7

antigravity 6

codex 6

gemini-cli 6

windsurf 5

Skill 文档

LLM Testing Patterns

Test AI applications with deterministic patterns using DeepEval and RAGAS.

Quick Reference

Mock LLM Responses

from unittest.mock import AsyncMock, patch

@pytest.fixture
def mock_llm():
    mock = AsyncMock()
    mock.return_value = {"content": "Mocked response", "confidence": 0.85}
    return mock

@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
    with patch("app.core.model_factory.get_model", return_value=mock_llm):
        result = await synthesize_findings(sample_findings)
    assert result["summary"] is not None

DeepEval Quality Testing

from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase(
    input="What is the capital of France?",
    actual_output="The capital of France is Paris.",
    retrieval_context=["Paris is the capital of France."],
)

metrics = [
    AnswerRelevancyMetric(threshold=0.7),
    FaithfulnessMetric(threshold=0.8),
]

assert_test(test_case, metrics)

Timeout Testing

import asyncio
import pytest

@pytest.mark.asyncio
async def test_respects_timeout():
    with pytest.raises(asyncio.TimeoutError):
        async with asyncio.timeout(0.1):
            await slow_llm_call()

Quality Metrics ()

Metric	Threshold	Purpose
Answer Relevancy	â¥ 0.7	Response addresses question
Faithfulness	â¥ 0.8	Output matches context
Hallucination	â¤ 0.3	No fabricated facts
Context Precision	â¥ 0.7	Retrieved contexts relevant

Anti-Patterns (FORBIDDEN)

# â NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)

# â NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))

# â NEVER skip timeout handling
await llm_call()  # No timeout!

# â ALWAYS mock LLM in unit tests
with patch("app.llm", mock_llm):
    result = await function_under_test()

# â ALWAYS use VCR.py for integration tests
@pytest.mark.vcr()
async def test_llm_integration():
    ...

Key Decisions

Decision	Recommendation
Mock vs VCR	VCR for integration, mock for unit
Timeout	Always test with < 1s timeout
Schema validation	Test both valid and invalid
Edge cases	Test all null/empty paths
Quality metrics	Use multiple dimensions (3-5)

Detailed Documentation

Resource	Description
references/deepeval-ragas-api.md	DeepEval & RAGAS API reference
examples/test-patterns.md	Complete test examples
checklists/llm-test-checklist.md	Setup and review checklists
scripts/llm-test-template.py	Starter test template

Related Skills

vcr-http-recording – Record LLM responses
llm-evaluation – Quality assessment
unit-testing – Test fundamentals

Capability Details

llm-response-mocking

Keywords: mock LLM, fake response, stub LLM, mock AI Solves:

Mock LLM responses in tests
Create deterministic AI test fixtures
Avoid live API calls in CI

async-timeout-testing

Keywords: timeout, async test, wait for, polling Solves:

Test async LLM operations
Handle timeout scenarios
Implement polling assertions

structured-output-validation

Keywords: structured output, JSON validation, schema validation, output format Solves:

Validate structured LLM output
Test JSON schema compliance
Assert output structure

deepeval-assertions

Keywords: DeepEval, assert_test, LLMTestCase, metric assertion Solves:

Use DeepEval for LLM assertions
Implement metric-based tests
Configure quality thresholds

golden-dataset-testing

Keywords: golden dataset, golden test, reference output, expected output Solves:

Test against golden datasets
Compare with reference outputs
Implement regression testing

vcr-recording

Keywords: VCR, cassette, record, replay, HTTP recording Solves:

Record LLM API responses
Replay recordings in tests
Create deterministic test suites

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台