q-exploratory-analysis
npx skills add https://github.com/tyrealq/q-skills --skill q-exploratory-analysis
Agent 安装分布
Skill 文档
Q-Exploratory-Analysis
Universal exploratory data analysis (EDA) for tabular datasets. Previews the dataset, interviews the user to confirm column measurement levels, and applies statistically appropriate analysis for each variable type, producing a structured TABLE/ folder and a holistic EXPLORATORY_SUMMARY.md with flagged insights.
IMPORTANT: This skill requires Bash execution. The pre-built
scripts/run_eda.pymust be copied from${SKILL_DIR}/scripts/â do NOT write a new script or inline Python. Run it (Phases 0-6) for CSVs and the Excel report, then writeEXPLORATORY_SUMMARY.mdyourself (Post-Script step) by reading the CSVs, consulting${SKILL_DIR}/references/summary_template.md, and using the Write tool directly.If in plan mode: write a brief plan â “Run q-exploratory-analysis skill: interview user for context and column types, deploy and execute run_eda.py, write EXPLORATORY_SUMMARY.md from generated CSVs.” â then call ExitPlanMode immediately. Do NOT attempt Stages A/B, script deployment, or any analysis while plan mode is active.
0. Dependencies
pandas
numpy
scipy
openpyxl # required for .xlsx input and Phase 6 Excel report
Install: pip install pandas numpy scipy openpyxl
0b. Script Deployment
Agent execution instructions:
- Determine this SKILL.md file’s directory path as
SKILL_DIR. - Script path =
${SKILL_DIR}/scripts/run_eda.py. - Reference path =
${SKILL_DIR}/references/summary_template.md.
The skill ships a pre-built scripts/run_eda.py. Do NOT write a new script.
Deploy the script to the project before first run:
mkdir -p scripts
cp "${SKILL_DIR}/scripts/run_eda.py" scripts/run_eda.py
Windows equivalent:
mkdir scripts 2>nul & copy "%SKILL_DIR%\scripts\run_eda.py" scripts\run_eda.pyOr in PowerShell:New-Item -ItemType Directory -Force scripts; Copy-Item "$env:SKILL_DIR/scripts/run_eda.py" scripts/run_eda.py
If the project already has scripts/run_eda.py, verify it matches the skill version
before running (the skill version is authoritative).
1. Requirements Gathering (Interview)
Before invoking the script, conduct a two-stage interview:
Stage A: Context Questions
Ask 2 questions before loading data:
- Research questions – What are you exploring? (guides which comparisons matter)
- Temporal column – Is there a date/time column for trend analysis?
Stage B: Column Classification Review
After the context questions, auto-detect column types and present for confirmation:
-
Load and preview – Run
run_eda.py --previewto getdf.head(),df.dtypes,df.nunique(). (Requires Bash â if in plan mode, see Plan Mode Guard above.) -
Auto-classify – Apply the heuristic rules from Section 3 to generate suggested types. Pay special attention to:
- High-cardinality numeric integers (views, revenue, duration) â these should be Continuous, not ID
- Low-cardinality integers (1-5, 1-7 scales) â ask whether Ordinal (scale) or Discrete (count)
- Columns with >40% missing â flag for user awareness
-
Present classification table – Show all columns organized by suggested type:
# Column Sample Values Unique Suggested Type Notes 1 id 1, 2, 3 768 ID nunique > 95% of n, non-numeric 2 views 100, 5000, 1M 1885 Continuous high-cardinality integer metric 3 platform YouTube, Twitch 5 Nominal (Group) grouping variable 4 rating 1, 2, 3, 4, 5 5 Ordinal low-cardinality integer – scale or count? … … … … … … -
Ask for confirmation – Use AskUserQuestion to present the table and ask: “If these all look correct, select Confirm all. Otherwise, select Corrections needed.”
-
Record confirmed types and map to script arguments.
-
Invoke the script immediately â Do NOT write inline Python. Map confirmed types to
--col_typespairs and runrun_eda.pyper Section 2.
2. Script Invocation
Standard invocation (built from interview):
python scripts/run_eda.py data.xlsx \
--col_types rating=ordinal views=continuous description=text record_id=id \
--group platform tier \
--output TABLE/
Windows note: Use
python(notpython3). If the system has both Python 2 and 3, usepy -3or the full path (e.g.,C:\Python312\python.exe).
Arguments:
| Argument | Required | Description |
|---|---|---|
data |
Yes | Path to .xlsx or .csv |
--col_types |
No | Confirmed types: col=type pairs. Valid: id, binary, nominal, ordinal, discrete, continuous, temporal, text. Unspecified columns auto-detected. |
--group |
No | Nominal columns for grouped analysis |
--output |
No | Output directory (default: TABLE/) |
--top_n |
No | Top-N for frequency tables (default: 10) |
--no_excel |
No | Skip Phase 6 (Excel report) |
--corr_deletion |
No | Missing-data strategy for correlations: pairwise (default) or listwise |
--interactive |
No | Prompt for ambiguous integers (standalone CLI only) |
Behavioral defaults:
--groupomitted: grouping defaults to all nominal columns (no cap).- Cross-tabs: all nominal column pairs are generated (no cap).
- ID detection: requires
n > 10rows; otherwise high-uniqueness columns stay as their detected numeric/string type. LOW_CARD_MAX = 20: integers with <= 20 unique values are flagged as ambiguous (ordinal vs. discrete).--corr_deletionomitted: uses pairwise deletion (drops missing per pair, maximizing N). Use--corr_deletion listwisefor consistent N across all pairs.
3. Column-Type Coverage
Claude presents suggested types during the interview; the user confirms or corrects before the script runs. Types follow Stevens’ levels of measurement extended with Temporal, Text, and ID/key types.
| Level | Detection Rule | Analysis Applied |
|---|---|---|
| ID/key | nunique > 95% of n AND non-numeric dtype | Excluded from analysis; flagged in profile |
| Binary | Exactly 2 unique values (any dtype) | Count, proportion, 95% CI |
| Nominal | object/string dtype, avg length <= 50 chars | Frequency table, mode, top-N |
| Ordinal | User-confirmed low-cardinality integer | Ordered freq table + cumulative % + full quantitative metrics (M labeled quasi-interval) |
| Discrete | integer dtype, meaningful count/quantity | Full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts) |
| Continuous | float dtype, OR numeric with nunique > 95% of n (n > 10) | Full quantitative metrics + outlier flags + distribution shape |
| Temporal | datetime dtype or parseable date string | Monthly aggregated trend table (mean/median per period); gap detection not yet implemented |
| Text | object dtype, avg length > 50 chars | Top unigrams/bigrams, avg word count, vocabulary richness |
4. Pipeline
Phase 0: Data Loading & Column Classification
Loads the file, applies user-confirmed column types (passed via --col_types), auto-detects unambiguous types for remaining columns, and prints a schema summary: column name | detected type | nunique | missing%.
Phase 1: Dataset Profile
01_dataset_profile.csv – shape, column types, missing%, uniqueness.
Phase 2: Data Quality Assessment
02_data_quality.csv – missing counts/%, duplicate rows, constant columns, outlier counts per Continuous/Discrete column (IQR method).
Phase 3: Univariate Analysis
Measurement-level-appropriate analysis for each column:
03_nominal_frequencies.csv– frequency + % for each Nominal column (top-N)04_binary_summary.csv– count, proportion, 95% CI for all Binary columns05_ordinal_distribution.csv– ordered frequency + cumulative %06_ordinal_descriptives.csv– full quantitative metrics (M quasi-interval, Mdn, SD, IQR, SE, 95% CI, skewness, kurtosis, outliers)07_discrete_descriptives.csv– full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)08_continuous_descriptives.csv– full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)
Phase 4: Bivariate & Multivariate Analysis
Measurement-appropriate pairing analysis:
09_pearson_correlation.csv– Ratio-scale x Ratio-scale (Continuous + Discrete; r + p-value; pairwise deletion by default)10_spearman_correlation.csv– Ordinal x Ordinal (rho + p-value; pairwise deletion by default)11_grouped_by_{groupvar}.csv– Continuous/Discrete/Ordinal descriptives per Nominal group12_crosstab_{nom1}_x_{nom2}.csv– Nominal x Nominal contingency tables
Phase 5: Specialized Analysis
13_text_{colname}.csv– word freq, top bigrams, avg word count, vocab size14_temporal_trends.csv– key metrics over time by period
Phase 6: Excel Report
EXPLORATORY_REPORT.xlsx – APA-7th formatted workbook (B&W, no color fills). Sheet 0 “Summary” contains dataset dimensions and quality highlights. Sheets 1-N contain one sheet per generated CSV (skipped if absent). All sheets use Calibri 11 pt, table-number bold rows, title italic rows, bold column headers with bottom border, and data rows with top/bottom rules only. Skip with --no_excel.
Post-Script: Narrative Summary (Model-Authored)
IMPORTANT: This step is performed by Claude after the script finishes. EXPLORATORY_SUMMARY.md is
produced by Claude reading the generated CSVs and writing a formatted document with the Write tool.
Before writing, read ${SKILL_DIR}/references/summary_template.md.
This template shows the exact expected structure, table formats, and narrative style for
each section. Use it as the structural blueprint â populate it with actual data from the
generated CSVs. The template includes worked example rows for every table type.
Step 1 â Read the CSVs using the Read tool:
01_dataset_profile.csv, 02_data_quality.csv, 03_nominal_frequencies.csv,
04_binary_summary.csv, 05_ordinal_distribution.csv, 06_ordinal_descriptives.csv,
07_discrete_descriptives.csv, 08_continuous_descriptives.csv,
09_pearson_correlation.csv, 10_spearman_correlation.csv,
all 11_grouped_by_*.csv, all 12_crosstab_*.csv, all 13_text_*.csv,
14_temporal_trends.csv.
Step 2 â Write TABLE/EXPLORATORY_SUMMARY.md directly using the Write tool.
Content requirements: One section per measurement level plus infrastructure sections. Required structure:
- Dataset Overview (Source:
01_dataset_profile.csv,14_temporal_trends.csvâ temporal range and period count only) - Data Quality (Source:
02_data_quality.csv) - Nominal Variables (Source:
03_nominal_frequencies.csv) - Binary Variables (Source:
04_binary_summary.csv) - Ordinal Variables (Source:
05_ordinal_distribution.csv,06_ordinal_descriptives.csv) - Discrete Variables (Source:
07_discrete_descriptives.csv) - Continuous Variables (Source:
08_continuous_descriptives.csv) - Bivariate Relationships (Source:
09_pearson_correlation.csv,10_spearman_correlation.csv) - Group Comparisons (Source:
11_grouped_by_*.csv) - Cross-Tabulations (Source:
12_crosstab_*.csv) - Text Analysis (Source:
13_text_*.csv; omit section if no text columns) - Temporal Trends (Source:
14_temporal_trends.csv; omit section if no temporal column â full trend table and pattern interpretation here) - Output Files â bullet list of all files in the output directory
Table formatting rules:
- Descriptive tables (ordinal
06, discrete07, continuous08): Split into two tables:- Core table: variable, N_valid, missing_count, missing_pct, M (or M_quasi_interval), Mdn, Mode, SD, IQR, Skewness, Kurtosis
- Detail table: variable, Variance, Range, Min, Max, Q1_25th, Q3_75th, CV_pct, SE, CI95_lower, CI95_upper, mild_outliers_IQR1.5, extreme_outliers_IQR3.0, P10_10th, P90_90th (discrete/continuous only)
- Narrow CSVs (frequencies
03, binary04, ordinal distribution05, correlations09/10, crosstabs12): Use the same columns as the source CSV. - Frequency tables: Include all rows from the CSV (script already caps at top-N).
- Correlation tables: Include all pairs; bold rows where abs(r/rho) > 0.7.
- Cross-tabs: Include full table from CSV. If a crosstab has >10 columns, note the CSV as the authoritative source and provide a prose summary instead.
- Do not round further than the CSV already rounds.
General rules:
- End each section with 1â3 sentences of interpretation.
- Every content section heading must include a
Source:annotation citing the CSV file(s) it draws from. Infrastructure sections (e.g., Output Files) are exempt. - NEVER derive findings from ad-hoc Python. All findings must come from the generated CSVs.
- Section omission: If all source CSVs for a section are absent, omit the section entirely. If some source CSVs exist but others are absent, include the section and note “No [subtype] data available (file skipped by script)” for the missing part.
5. Output Directory Reference
TABLE/
âââ 01_dataset_profile.csv
âââ 02_data_quality.csv
âââ 03_nominal_frequencies.csv
âââ 04_binary_summary.csv
âââ 05_ordinal_distribution.csv
âââ 06_ordinal_descriptives.csv
âââ 07_discrete_descriptives.csv
âââ 08_continuous_descriptives.csv
âââ 09_pearson_correlation.csv
âââ 10_spearman_correlation.csv
âââ 11_grouped_by_{groupvar}.csv (one per group variable)
âââ 12_crosstab_{nom1}_x_{nom2}.csv (one per Nominal pair)
âââ 13_text_{colname}.csv (one per text column)
âââ 14_temporal_trends.csv
âââ EXPLORATORY_REPORT.xlsx (omitted with --no_excel)
âââ EXPLORATORY_SUMMARY.md
Files are omitted when no columns of the relevant type exist (e.g., no text columns means no 13_text_*.csv).
6. Interpreting the Summary
EXPLORATORY_SUMMARY.md uses a flagged-insight format rather than just reporting numbers:
- Issue flags – “Column X has 34% missing – review before analysis”
- Pattern highlights – “Strong positive correlation between views and likes (r=0.87, p<.001)”
- Distribution notes – “Duration is right-skewed (skewness=3.1) – median better represents typical values”
- Coverage summary – “768 records across 5 platforms; YouTube accounts for 62% of sample”
- Measurement caveats – Ordinal means labeled as quasi-interval; Spearman used instead of Pearson for ranked scales
Flagging thresholds:
| Flag | Threshold | Severity |
|---|---|---|
| Missing data | >20% of rows | review |
| Skewness | abs(skew) > 2 | distribution |
| Kurtosis | abs(kurt) > 7 | distribution |
| Correlation | abs(r/rho) > 0.7 | relationship |
| CV | >100% | variability |
| ID-like column | nunique > 95% of n | exclusion |
7. Design Principles
- Exploratory-first – No confirmatory statistics; builds the picture before hypothesis testing
- User-confirmed classification – Claude suggests measurement levels from heuristic rules; the user reviews and confirms before analysis runs
- Measurement-appropriate – Uses median/IQR for Ordinal, Pearson for Continuous, Spearman for Ordinal, cross-tabulation for Nominal
- Insight-flagging – Summary reports patterns and warnings, not just numbers
- Dual output – CSVs for validation and import into results tables; markdown for interpretation
- APA-compatible statistics – Full metric set (M, Mdn, SD, SE, 95% CI, skewness, kurtosis) ready for APA 7th reporting
8. Verification Checklist
- Column classification table presented to user; user confirmed or corrected types
- Confirmed types passed via
--col_typesand grouping columns via--group - Each detected column type has at least one output file
-
EXPLORATORY_SUMMARY.mdwritten by Claude via Write tool - All 13 content sections present where applicable; sections for absent column types omitted per two-tier rule
- Descriptive tables (ordinal/discrete/continuous) use core + detail split-table format
- Every content section contains at least one populated markdown table and one narrative sentence. Infrastructure sections (e.g., Output Files) are exempt.
- Numbers in the summary match the source CSVs exactly
-
EXPLORATORY_REPORT.xlsxpresent with “Summary” as first sheet + one sheet per CSV (unless--no_excel) - Excel workbook is B&W only (no color fills; no gridlines)
- Bivariate outputs use correct method for each level pairing
- High-cardinality numeric columns classified as Continuous, not ID (check
01_dataset_profile.csv) - Pearson correlations include both continuous and discrete ratio-scale columns
- Every
EXPLORATORY_SUMMARY.mdsection cites its source CSV file(s) in the heading - No ad-hoc Python used to derive findings outside the script pipeline
- Script invoked with
python(notpython3) on Windows
Future Roadmap
| Enhancement | Rationale |
|---|---|
| Visualization suite | Box plots, histograms, heatmaps, violin plots per measurement level |
| Normality tests (Shapiro-Wilk for n<5000) | Formal distribution assessment for Continuous variables |
| Hypothesis generation section | Bridge EDA to confirmatory research questions |
ydata-profiling HTML report |
One-line automated profiling complement |
| Interval vs. Ratio distinction | Full Stevens’ framework for specific research contexts |