ai-error-analysis-and-eval-design

📁 samarv/shanon 📅 4 days ago

总安装量

周安装量

#51765

全站排名

安装命令

npx skills add https://github.com/samarv/shanon --skill ai-error-analysis-and-eval-design

Agent 安装分布

amp 1

opencode 1

kimi-cli 1

codex 1

github-copilot 1

claude-code 1

Skill 文档

To build great AI products, you must transition from subjective “vibe checks” to systematic measurement. This process identifies exactly where an LLM is failing and creates a feedback loop for continuous improvement.

Phase 1: Open Coding (The “Benevolent Dictator” Phase)

Before automating, you must manually ground yourself in the data. Appoint one “Benevolent Dictator”âtypically the Product Manager or domain expertâto define “good” taste.

Sample the Data: Extract 50â100 “traces” (logs of full LLM interactions) from your observability tool (e.g., Braintrust, LangSmith, Phoenix).
Note the Upstream Error: Read each trace. If something is wrong, write a brief, informal note (an “Open Code”) describing the first thing that went wrong.
- Rule: Don’t overthink it. Use specific language (e.g., “hallucinated virtual tour,” “didn’t confirm call transfer”) rather than just “bad.”
Stop at Saturation: Continue until you stop learning new ways the system fails (Theoretical Saturation).

Phase 2: Axial Coding (Categorization)

Synthesize your mess of notes into actionable categories using an LLM.

Export Notes: Put your open codes into a CSV or spreadsheet.
Synthesize Failure Modes: Use an LLM (Claude or ChatGPT) to group your notes into 5â7 “Axial Codes” (failure categories).
- Prompt Pattern: “Analyze these manual notes from AI traces and group them into actionable failure categories (Axial Codes). Each category should represent a specific product problem.”
Map Back: Use a spreadsheet formula or LLM to categorize every trace into one of these buckets.
Prioritize: Create a pivot table to count the frequency of each category. Focus your engineering efforts on the highest-frequency or highest-risk buckets.

Phase 3: Build the “LLM as Judge”

For complex, subjective failures (like “human handoff quality”), create an automated evaluator.

Write the Judge Prompt: Create a separate prompt for an LLM whose only job is to evaluate one specific failure mode.
Enforce Binary Scoring: Require the judge to output only True or False.
- Note: Avoid 1â5 or 1â10 scales. They result in “weasel” metrics (e.g., a score of 3.7) that provide no clear direction for improvement.
Define Rules: Include specific criteria from your “Benevolent Dictator” notes.
- Example: “Output True if the user explicitly asked for a human and the assistant responded with a tool call without acknowledging the request.”

Phase 4: Alignment & Validation

Never ship an eval until you know the judge matches human judgment.

Create an Agreement Matrix: Compare the Judge’s True/False labels against your manual labels from Phase 1.
Review Mismatches: Specifically look at:
- False Positives: Judge said error, Human said no error.
- False Negatives: Human said error, Judge said no error.
Iterate: Refine the Judge’s prompt until it aligns with the “Benevolent Dictator” at least 80â90% of the time.

Examples

Example 1: Real Estate AI Assistant

Context: AI is supposed to book apartment tours.
Open Code: “AI told the user a virtual tour was available when the property only offers in-person tours.”
Axial Code: “Capability Misrepresentation.”
Judge Logic: “Check the ‘Property Context’ tool output. If ‘virtual_tour’ is False, but the LLM response contains ‘virtual tour,’ output True (Error).”

Example 2: Customer Support Handoff

Context: AI should hand off to a human for sensitive issues.
Open Code: “User said they were frustrated with a leak, AI just gave a generic maintenance link.”
Axial Code: “Handoff Protocol Violation.”
Judge Logic: “Search for sentiment indicating frustration or emergency. If found, did the AI offer a human transfer? If no, output True (Error).”

Common Pitfalls

Likert Scales: Using 1â5 scales makes it impossible to know if a change in score is meaningful. Use binary True/False.
Automating Too Early: Do not let an LLM do the initial “Open Coding.” It lacks the product context to know what “janky” looks like for your specific business.
Committee Judging: Don’t use a committee to define “good.” Appoint one person with the best domain taste to be the final arbiter (The Benevolent Dictator).
Chasing Generic Metrics: Don’t rely on generic evals like “hallucination score” or “cosine similarity.” They rarely correlate with product-specific success.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台