prompt-testing
1
总安装量
1
周安装量
#42872
全站排名
安装命令
npx skills add https://github.com/fusengine/agents --skill prompt-testing
Agent 安装分布
amp
1
opencode
1
kimi-cli
1
codex
1
gemini-cli
1
Skill 文档
Prompt Testing
Skill for testing, comparing, and measuring prompt performance.
Documentation
- metrics.md – Performance metrics definition
- methodology.md – A/B testing protocol
Testing Workflow
1. DEFINE
âââ Test objective
âââ Metrics to measure
âââ Success criteria
2. PREPARE
âââ Variants A and B
âââ Test dataset
âââ Baseline (if existing)
3. EXECUTE
âââ Run on dataset
âââ Collect results
âââ Document observations
4. ANALYZE
âââ Calculate metrics
âââ Compare variants
âââ Identify patterns
5. DECIDE
âââ Recommendation
âââ Statistical confidence
âââ Next iterations
Performance Metrics
Quality
| Metric | Description | Calculation |
|---|---|---|
| Accuracy | Correct responses | Correct / Total |
| Compliance | Format adherence | Compliant / Total |
| Consistency | Response stability | 1 – Variance |
| Relevance | Meeting the need | Average score (1-5) |
Efficiency
| Metric | Description | Calculation |
|---|---|---|
| Tokens Input | Prompt size | Token count |
| Tokens Output | Response size | Token count |
| Latency | Response time | ms |
| Cost | Price per request | Tokens à Price |
Robustness
| Metric | Description | Calculation |
|---|---|---|
| Edge Cases | Edge case handling | Passed / Total |
| Jailbreak Resist | Bypass resistance | Blocked / Attempts |
| Error Recovery | Error recovery | Recovered / Errors |
Test Format
Test Dataset
{
"name": "Test Dataset v1",
"description": "Dataset for testing prompt XYZ",
"cases": [
{
"id": "case_001",
"type": "standard",
"input": "Test input",
"expected": "Expected output",
"tags": ["basic", "format"]
},
{
"id": "case_002",
"type": "edge_case",
"input": "Edge input",
"expected": "Expected behavior",
"tags": ["edge", "error"]
}
]
}
Test Report
# A/B Test Report: {{TEST_NAME}}
## Configuration
| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |
## Tested Variants
### Variant A (Baseline)
[Description or link to prompt A]
### Variant B (Challenger)
[Description or link to prompt B]
## Results
### Overall Scores
| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |
### Detail by Case Type
| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |
### Problematic Cases
| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | â | â
| [Explanation] |
## Analysis
### B's Strengths
- [Improvement 1]
- [Improvement 2]
### B's Weaknesses
- [Regression 1]
### Observations
[Qualitative insights]
## Recommendation
**Verdict**: â
Adopt B / â ï¸ Iterate / â Keep A
**Confidence**: High / Medium / Low
**Justification**:
[Explanation of recommendation]
## Next Steps
1. [Action 1]
2. [Action 2]
Commands
# Create a test
/prompt test create --name "Test v1" --dataset tests.json
# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json
# View results
/prompt test results --id test_001
# Compare two tests
/prompt test compare --tests test_001,test_002
Decision Criteria
When to adopt variant B?
IF:
- Accuracy B >= Accuracy A
AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
AND no regression on edge cases
THEN:
â Adopt B
ELSE IF:
- Accuracy improvement > 10%
AND token regression < 20%
THEN:
â Consider B (acceptable trade-off)
ELSE:
â Keep A or iterate
Best Practices
- Minimum 20 test cases for significance
- Include edge cases (15-20% of dataset)
- Test multiple runs for consistency
- Document hypotheses before testing
- Version the prompts being tested