ai-error-analysis-and-eval-design
1
总安装量
1
周安装量
#51765
全站排名
安装命令
npx skills add https://github.com/samarv/shanon --skill ai-error-analysis-and-eval-design
Agent 安装分布
amp
1
opencode
1
kimi-cli
1
codex
1
github-copilot
1
claude-code
1
Skill 文档
To build great AI products, you must transition from subjective “vibe checks” to systematic measurement. This process identifies exactly where an LLM is failing and creates a feedback loop for continuous improvement.
Phase 1: Open Coding (The “Benevolent Dictator” Phase)
Before automating, you must manually ground yourself in the data. Appoint one “Benevolent Dictator”âtypically the Product Manager or domain expertâto define “good” taste.
- Sample the Data: Extract 50â100 “traces” (logs of full LLM interactions) from your observability tool (e.g., Braintrust, LangSmith, Phoenix).
- Note the Upstream Error: Read each trace. If something is wrong, write a brief, informal note (an “Open Code”) describing the first thing that went wrong.
- Rule: Don’t overthink it. Use specific language (e.g., “hallucinated virtual tour,” “didn’t confirm call transfer”) rather than just “bad.”
- Stop at Saturation: Continue until you stop learning new ways the system fails (Theoretical Saturation).
Phase 2: Axial Coding (Categorization)
Synthesize your mess of notes into actionable categories using an LLM.
- Export Notes: Put your open codes into a CSV or spreadsheet.
- Synthesize Failure Modes: Use an LLM (Claude or ChatGPT) to group your notes into 5â7 “Axial Codes” (failure categories).
- Prompt Pattern: “Analyze these manual notes from AI traces and group them into actionable failure categories (Axial Codes). Each category should represent a specific product problem.”
- Map Back: Use a spreadsheet formula or LLM to categorize every trace into one of these buckets.
- Prioritize: Create a pivot table to count the frequency of each category. Focus your engineering efforts on the highest-frequency or highest-risk buckets.
Phase 3: Build the “LLM as Judge”
For complex, subjective failures (like “human handoff quality”), create an automated evaluator.
- Write the Judge Prompt: Create a separate prompt for an LLM whose only job is to evaluate one specific failure mode.
- Enforce Binary Scoring: Require the judge to output only True or False.
- Note: Avoid 1â5 or 1â10 scales. They result in “weasel” metrics (e.g., a score of 3.7) that provide no clear direction for improvement.
- Define Rules: Include specific criteria from your “Benevolent Dictator” notes.
- Example: “Output True if the user explicitly asked for a human and the assistant responded with a tool call without acknowledging the request.”
Phase 4: Alignment & Validation
Never ship an eval until you know the judge matches human judgment.
- Create an Agreement Matrix: Compare the Judge’s True/False labels against your manual labels from Phase 1.
- Review Mismatches: Specifically look at:
- False Positives: Judge said error, Human said no error.
- False Negatives: Human said error, Judge said no error.
- Iterate: Refine the Judge’s prompt until it aligns with the “Benevolent Dictator” at least 80â90% of the time.
Examples
Example 1: Real Estate AI Assistant
- Context: AI is supposed to book apartment tours.
- Open Code: “AI told the user a virtual tour was available when the property only offers in-person tours.”
- Axial Code: “Capability Misrepresentation.”
- Judge Logic: “Check the ‘Property Context’ tool output. If ‘virtual_tour’ is False, but the LLM response contains ‘virtual tour,’ output True (Error).”
Example 2: Customer Support Handoff
- Context: AI should hand off to a human for sensitive issues.
- Open Code: “User said they were frustrated with a leak, AI just gave a generic maintenance link.”
- Axial Code: “Handoff Protocol Violation.”
- Judge Logic: “Search for sentiment indicating frustration or emergency. If found, did the AI offer a human transfer? If no, output True (Error).”
Common Pitfalls
- Likert Scales: Using 1â5 scales makes it impossible to know if a change in score is meaningful. Use binary True/False.
- Automating Too Early: Do not let an LLM do the initial “Open Coding.” It lacks the product context to know what “janky” looks like for your specific business.
- Committee Judging: Don’t use a committee to define “good.” Appoint one person with the best domain taste to be the final arbiter (The Benevolent Dictator).
- Chasing Generic Metrics: Don’t rely on generic evals like “hallucination score” or “cosine similarity.” They rarely correlate with product-specific success.