prompt-testing

📁 fusengine/agents 📅 3 days ago
1
总安装量
1
周安装量
#42872
全站排名
安装命令
npx skills add https://github.com/fusengine/agents --skill prompt-testing

Agent 安装分布

amp 1
opencode 1
kimi-cli 1
codex 1
gemini-cli 1

Skill 文档

Prompt Testing

Skill for testing, comparing, and measuring prompt performance.

Documentation

Testing Workflow

1. DEFINE
   └── Test objective
   └── Metrics to measure
   └── Success criteria

2. PREPARE
   └── Variants A and B
   └── Test dataset
   └── Baseline (if existing)

3. EXECUTE
   └── Run on dataset
   └── Collect results
   └── Document observations

4. ANALYZE
   └── Calculate metrics
   └── Compare variants
   └── Identify patterns

5. DECIDE
   └── Recommendation
   └── Statistical confidence
   └── Next iterations

Performance Metrics

Quality

Metric Description Calculation
Accuracy Correct responses Correct / Total
Compliance Format adherence Compliant / Total
Consistency Response stability 1 – Variance
Relevance Meeting the need Average score (1-5)

Efficiency

Metric Description Calculation
Tokens Input Prompt size Token count
Tokens Output Response size Token count
Latency Response time ms
Cost Price per request Tokens × Price

Robustness

Metric Description Calculation
Edge Cases Edge case handling Passed / Total
Jailbreak Resist Bypass resistance Blocked / Attempts
Error Recovery Error recovery Recovered / Errors

Test Format

Test Dataset

{
  "name": "Test Dataset v1",
  "description": "Dataset for testing prompt XYZ",
  "cases": [
    {
      "id": "case_001",
      "type": "standard",
      "input": "Test input",
      "expected": "Expected output",
      "tags": ["basic", "format"]
    },
    {
      "id": "case_002",
      "type": "edge_case",
      "input": "Edge input",
      "expected": "Expected behavior",
      "tags": ["edge", "error"]
    }
  ]
}

Test Report

# A/B Test Report: {{TEST_NAME}}

## Configuration

| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |

## Tested Variants

### Variant A (Baseline)
[Description or link to prompt A]

### Variant B (Challenger)
[Description or link to prompt B]

## Results

### Overall Scores

| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |

### Detail by Case Type

| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |

### Problematic Cases

| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |

## Analysis

### B's Strengths
- [Improvement 1]
- [Improvement 2]

### B's Weaknesses
- [Regression 1]

### Observations
[Qualitative insights]

## Recommendation

**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A

**Confidence**: High / Medium / Low

**Justification**:
[Explanation of recommendation]

## Next Steps
1. [Action 1]
2. [Action 2]

Commands

# Create a test
/prompt test create --name "Test v1" --dataset tests.json

# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json

# View results
/prompt test results --id test_001

# Compare two tests
/prompt test compare --tests test_001,test_002

Decision Criteria

When to adopt variant B?

IF:
  - Accuracy B >= Accuracy A
  AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
  AND no regression on edge cases
THEN:
  → Adopt B

ELSE IF:
  - Accuracy improvement > 10%
  AND token regression < 20%
THEN:
  → Consider B (acceptable trade-off)

ELSE:
  → Keep A or iterate

Best Practices

  1. Minimum 20 test cases for significance
  2. Include edge cases (15-20% of dataset)
  3. Test multiple runs for consistency
  4. Document hypotheses before testing
  5. Version the prompts being tested