llm-testing
12
总安装量
12
周安装量
#26805
全站排名
安装命令
npx skills add https://github.com/yonatangross/orchestkit --skill llm-testing
Agent 安装分布
claude-code
9
opencode
7
antigravity
6
codex
6
gemini-cli
6
windsurf
5
Skill 文档
LLM Testing Patterns
Test AI applications with deterministic patterns using DeepEval and RAGAS.
Quick Reference
Mock LLM Responses
from unittest.mock import AsyncMock, patch
@pytest.fixture
def mock_llm():
mock = AsyncMock()
mock.return_value = {"content": "Mocked response", "confidence": 0.85}
return mock
@pytest.mark.asyncio
async def test_with_mocked_llm(mock_llm):
with patch("app.core.model_factory.get_model", return_value=mock_llm):
result = await synthesize_findings(sample_findings)
assert result["summary"] is not None
DeepEval Quality Testing
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["Paris is the capital of France."],
)
metrics = [
AnswerRelevancyMetric(threshold=0.7),
FaithfulnessMetric(threshold=0.8),
]
assert_test(test_case, metrics)
Timeout Testing
import asyncio
import pytest
@pytest.mark.asyncio
async def test_respects_timeout():
with pytest.raises(asyncio.TimeoutError):
async with asyncio.timeout(0.1):
await slow_llm_call()
Quality Metrics ()
| Metric | Threshold | Purpose |
|---|---|---|
| Answer Relevancy | ⥠0.7 | Response addresses question |
| Faithfulness | ⥠0.8 | Output matches context |
| Hallucination | ⤠0.3 | No fabricated facts |
| Context Precision | ⥠0.7 | Retrieved contexts relevant |
Anti-Patterns (FORBIDDEN)
# â NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)
# â NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))
# â NEVER skip timeout handling
await llm_call() # No timeout!
# â
ALWAYS mock LLM in unit tests
with patch("app.llm", mock_llm):
result = await function_under_test()
# â
ALWAYS use VCR.py for integration tests
@pytest.mark.vcr()
async def test_llm_integration():
...
Key Decisions
| Decision | Recommendation |
|---|---|
| Mock vs VCR | VCR for integration, mock for unit |
| Timeout | Always test with < 1s timeout |
| Schema validation | Test both valid and invalid |
| Edge cases | Test all null/empty paths |
| Quality metrics | Use multiple dimensions (3-5) |
Detailed Documentation
| Resource | Description |
|---|---|
| references/deepeval-ragas-api.md | DeepEval & RAGAS API reference |
| examples/test-patterns.md | Complete test examples |
| checklists/llm-test-checklist.md | Setup and review checklists |
| scripts/llm-test-template.py | Starter test template |
Related Skills
vcr-http-recording– Record LLM responsesllm-evaluation– Quality assessmentunit-testing– Test fundamentals
Capability Details
llm-response-mocking
Keywords: mock LLM, fake response, stub LLM, mock AI Solves:
- Mock LLM responses in tests
- Create deterministic AI test fixtures
- Avoid live API calls in CI
async-timeout-testing
Keywords: timeout, async test, wait for, polling Solves:
- Test async LLM operations
- Handle timeout scenarios
- Implement polling assertions
structured-output-validation
Keywords: structured output, JSON validation, schema validation, output format Solves:
- Validate structured LLM output
- Test JSON schema compliance
- Assert output structure
deepeval-assertions
Keywords: DeepEval, assert_test, LLMTestCase, metric assertion Solves:
- Use DeepEval for LLM assertions
- Implement metric-based tests
- Configure quality thresholds
golden-dataset-testing
Keywords: golden dataset, golden test, reference output, expected output Solves:
- Test against golden datasets
- Compare with reference outputs
- Implement regression testing
vcr-recording
Keywords: VCR, cassette, record, replay, HTTP recording Solves:
- Record LLM API responses
- Replay recordings in tests
- Create deterministic test suites