scientific-validation
10
总安装量
7
周安装量
#28840
全站排名
安装命令
npx skills add https://github.com/akaszubski/autonomous-dev --skill scientific-validation
Agent 安装分布
opencode
7
claude-code
7
github-copilot
7
codex
7
gemini-cli
7
antigravity
6
Skill 文档
Scientific Validation Skill
Rigorous methodology for validating claims from any source – books, papers, theories, or intuition.
When This Skill Activates
- Testing claims from books, papers, or expert sources
- Validating rules, strategies, or hypotheses
- Running experiments or backtests
- Keywords: “validate”, “test hypothesis”, “experiment”, “backtest”, “prove”, “evidence”
Core Principle
Data is the arbiter. Sources can be wrong.
- Expert books can be wrong
- Only empirical validation decides what works
- Document negative results – they’re valuable
Phase Overview
| Phase | Name | Key Requirement |
|---|---|---|
| 0 | Claim Verification | Understand what source ACTUALLY claims |
| 1 | Claims Extraction | Document with source citations |
| 1.5 | Publication Bias Prevention | Document ALL claims before selecting |
| 2 | Pre-Registration | Hypothesis BEFORE seeing results |
| 2.3 | Power Analysis | Calculate required n (MANDATORY) |
| 3 | Bias Prevention | Look-ahead, survivorship, selection |
| 3.5 | Walk-Forward | Required for time series (MANDATORY) |
| 4 | Statistical Requirements | p-values, effect sizes, corrections |
| 4.7 | Bayesian Complement | Bayes Factors for ambiguous results |
| 5 | Multi-Source Validation | Test across 3+ contexts |
| 5.3 | Sensitivity Analysis | ±20% parameter stability (MANDATORY) |
| 5.5 | Adversarial Review | Invoke experiment-critic agent |
| 6 | Classification | VALIDATED / REJECTED / INSUFFICIENT |
| 7 | Documentation | Complete audit trail |
| 7.3 | Negative Results | Structured failure documentation |
See: workflow.md for detailed step-by-step instructions per phase.
Quick Reference
Claim Types
| Type | Testable? | Example |
|---|---|---|
| PERFORMANCE | YES | “A beats B on metric X” |
| METHODOLOGICAL | YES | “A enables capability X” |
| PHILOSOPHICAL | MAYBE | “X is important because Y” |
| BEHAVIORAL | HARD | “Humans do X in situation Y” |
Sample Size Requirements (80% Power)
| Effect Size | Cohen’s d | Required n |
|---|---|---|
| Small | 0.2 | 394 |
| Medium | 0.5 | 64 |
| Large | 0.8 | 26 |
See: code-examples.md#power-analysis for calculation code.
Classification Criteria
| Status | Criteria |
|---|---|
| VALIDATED | OOS meets all criteria + critic PROCEED |
| CONDITIONAL | OOS meets relaxed criteria (p < 0.10) |
| REJECTED | OOS fails OR negative effect |
| INSUFFICIENT | n < 15 in OOS |
| UNTESTABLE | Required data unavailable |
| INVALID | Circular validation detected |
Domain Effect Thresholds (Trading)
| Metric | Minimum | Strong | Exceptional |
|---|---|---|---|
| Sharpe Ratio | > 0.5 | > 1.0 | > 2.0 |
| Win Rate | > 55% | > 60% | > 70% |
| Profit Factor | > 1.2 | > 1.5 | > 2.0 |
See: code-examples.md#effect-thresholds for other domains.
Bayes Factor Interpretation
| BF | Evidence |
|---|---|
| < 1 | Supports null |
| 1-3 | Anecdotal |
| 3-10 | Moderate |
| 10-30 | Strong |
| > 30 | Very strong |
Critical Rules
1. Pre-Registration
- Document hypothesis BEFORE seeing any results
- Define success criteria BEFORE testing
- No peeking at test data
2. Power Analysis (Phase 2.3)
from statsmodels.stats.power import TTestIndPower
n = TTestIndPower().solve_power(effect_size=0.5, power=0.80, alpha=0.05)
Rule: Underpowered studies cannot achieve VALIDATED status.
3. Walk-Forward for Time Series (Phase 3.5)
- Standard K-fold CV â INVALID (temporal leakage)
- Single train/test â CONDITIONAL at best
- Walk-forward â Can achieve VALIDATED
See: code-examples.md#walk-forward for implementation.
4. Multiple Comparison Correction
alpha_corrected = 0.05 / num_claims # Bonferroni
For trading claims: require t-ratio > 3.0 (Harvey et al. standard).
5. Sensitivity Analysis (Phase 5.3)
Test ±20% parameter variation:
- All variations positive â Can achieve VALIDATED
- 1-2 sign flips â CONDITIONAL at best
- 3+ sign flips â REJECTED (fragile)
See: code-examples.md#sensitivity-analysis for implementation.
6. Adversarial Review (Phase 5.5)
Use Task tool:
subagent_type: "experiment-critic"
prompt: "Review experiment EXP-XXX"
MANDATORY before any classification.
Bias Prevention Checklist
| Bias | Prevention |
|---|---|
| Look-ahead | Process data sequentially, compare batch vs streaming |
| Survivorship | Track ALL attempts, not just completions |
| Selection | Report ALL experiments including failures |
| Data snooping | Strict train/test split, no tuning on test data |
| Publication | Document ALL claims before selecting which to test |
Pre-Experiment Checklist
- Claim extracted with source citation
- ALL claims documented (not just tested ones)
- Hypothesis documented BEFORE results
- Power analysis: required n calculated
- Success criteria defined
- Walk-forward configured (time series)
- Costs/constraints specified
Post-Experiment Checklist
- Sample size adequate per power analysis
- p-value AND effect size reported
- Bayesian analysis if ambiguous
- Sensitivity analysis passed
- Adversarial review completed
- Negative results documented if REJECTED
Red Flags
- 100% success rate â Possible bias
- OOS better than training â Possible leakage
- Result flips with ±20% params â Fragile
- Only tested “interesting” claims â Selection bias
Key Principles
- Hypothesis BEFORE data – No peeking
- Power analysis BEFORE experiment – Know required n
- Walk-forward for time series – Preserve temporal order
- Sensitivity analysis – Results must survive ±20% changes
- Adversarial self-critique – Challenge your methodology
- Document negative results – Failures are valuable
- Sources can be wrong – Even experts, even textbooks
Detailed Documentation
| Topic | File |
|---|---|
| Step-by-step workflow | workflow.md |
| Python code examples | code-examples.md |
| Markdown templates | templates.md |
| Adversarial review | ../../agents/experiment-critic.md |
Hard Rules
FORBIDDEN:
- Reporting results without confidence intervals or statistical significance
- Cherry-picking favorable metrics while ignoring unfavorable ones
- Claiming causation from correlation without controlled experiments
REQUIRED:
- All experiments MUST have a documented hypothesis before execution
- All results MUST include sample size, variance, and statistical test used
- Negative results MUST be reported with the same rigor as positive results
- Baselines MUST be established and compared against for every metric