q-exploratory-analysis

📁 tyrealq/q-skills 📅 4 days ago

总安装量

周安装量

#33546

全站排名

安装命令

npx skills add https://github.com/tyrealq/q-skills --skill q-exploratory-analysis

Agent 安装分布

codex 7

mcpjam 6

iflow-cli 6

junie 6

windsurf 6

zencoder 6

Skill 文档

Q-Exploratory-Analysis

Universal exploratory data analysis (EDA) for tabular datasets. Previews the dataset, interviews the user to confirm column measurement levels, and applies statistically appropriate analysis for each variable type, producing a structured TABLE/ folder and a holistic EXPLORATORY_SUMMARY.md with flagged insights.

IMPORTANT: This skill requires Bash execution. The pre-built scripts/run_eda.py must be copied from ${SKILL_DIR}/scripts/ â do NOT write a new script or inline Python. Run it (Phases 0-6) for CSVs and the Excel report, then write EXPLORATORY_SUMMARY.md yourself (Post-Script step) by reading the CSVs, consulting ${SKILL_DIR}/references/summary_template.md, and using the Write tool directly.

If in plan mode: write a brief plan â “Run q-exploratory-analysis skill: interview user for context and column types, deploy and execute run_eda.py, write EXPLORATORY_SUMMARY.md from generated CSVs.” â then call ExitPlanMode immediately. Do NOT attempt Stages A/B, script deployment, or any analysis while plan mode is active.

0. Dependencies

pandas
numpy
scipy
openpyxl   # required for .xlsx input and Phase 6 Excel report

Install: pip install pandas numpy scipy openpyxl

0b. Script Deployment

Agent execution instructions:

Determine this SKILL.md file’s directory path as SKILL_DIR.
Script path = ${SKILL_DIR}/scripts/run_eda.py.
Reference path = ${SKILL_DIR}/references/summary_template.md.

The skill ships a pre-built scripts/run_eda.py. Do NOT write a new script.

Deploy the script to the project before first run:

mkdir -p scripts
cp "${SKILL_DIR}/scripts/run_eda.py" scripts/run_eda.py

Windows equivalent: mkdir scripts 2>nul & copy "%SKILL_DIR%\scripts\run_eda.py" scripts\run_eda.py Or in PowerShell: New-Item -ItemType Directory -Force scripts; Copy-Item "$env:SKILL_DIR/scripts/run_eda.py" scripts/run_eda.py

If the project already has scripts/run_eda.py, verify it matches the skill version before running (the skill version is authoritative).

1. Requirements Gathering (Interview)

Before invoking the script, conduct a two-stage interview:

Stage A: Context Questions

Ask 2 questions before loading data:

Research questions – What are you exploring? (guides which comparisons matter)
Temporal column – Is there a date/time column for trend analysis?

Stage B: Column Classification Review

After the context questions, auto-detect column types and present for confirmation:

Load and preview – Run run_eda.py --preview to get df.head(), df.dtypes, df.nunique(). (Requires Bash â if in plan mode, see Plan Mode Guard above.)
Auto-classify – Apply the heuristic rules from Section 3 to generate suggested types. Pay special attention to:
- High-cardinality numeric integers (views, revenue, duration) â these should be Continuous, not ID
- Low-cardinality integers (1-5, 1-7 scales) â ask whether Ordinal (scale) or Discrete (count)
- Columns with >40% missing â flag for user awareness

Present classification table – Show all columns organized by suggested type:

#	Column	Sample Values	Unique	Suggested Type	Notes
1	id	1, 2, 3	768	ID	nunique > 95% of n, non-numeric
2	views	100, 5000, 1M	1885	Continuous	high-cardinality integer metric
3	platform	YouTube, Twitch	5	Nominal (Group)	grouping variable
4	rating	1, 2, 3, 4, 5	5	Ordinal	low-cardinality integer – scale or count?
…	…	…	…	…	…

Ask for confirmation – Use AskUserQuestion to present the table and ask: “If these all look correct, select Confirm all. Otherwise, select Corrections needed.”
Record confirmed types and map to script arguments.
Invoke the script immediately â Do NOT write inline Python. Map confirmed types to --col_types pairs and run run_eda.py per Section 2.

2. Script Invocation

Standard invocation (built from interview):

python scripts/run_eda.py data.xlsx \
  --col_types rating=ordinal views=continuous description=text record_id=id \
  --group platform tier \
  --output TABLE/

Windows note: Use python (not python3). If the system has both Python 2 and 3, use py -3 or the full path (e.g., C:\Python312\python.exe).

Arguments:

Argument	Required	Description
`data`	Yes	Path to .xlsx or .csv
`--col_types`	No	Confirmed types: `col=type` pairs. Valid: id, binary, nominal, ordinal, discrete, continuous, temporal, text. Unspecified columns auto-detected.
`--group`	No	Nominal columns for grouped analysis
`--output`	No	Output directory (default: `TABLE/`)
`--top_n`	No	Top-N for frequency tables (default: 10)
`--no_excel`	No	Skip Phase 6 (Excel report)
`--corr_deletion`	No	Missing-data strategy for correlations: `pairwise` (default) or `listwise`
`--interactive`	No	Prompt for ambiguous integers (standalone CLI only)

Behavioral defaults:

--group omitted: grouping defaults to all nominal columns (no cap).
Cross-tabs: all nominal column pairs are generated (no cap).
ID detection: requires n > 10 rows; otherwise high-uniqueness columns stay as their detected numeric/string type.
LOW_CARD_MAX = 20: integers with <= 20 unique values are flagged as ambiguous (ordinal vs. discrete).
--corr_deletion omitted: uses pairwise deletion (drops missing per pair, maximizing N). Use --corr_deletion listwise for consistent N across all pairs.

3. Column-Type Coverage

Claude presents suggested types during the interview; the user confirms or corrects before the script runs. Types follow Stevens’ levels of measurement extended with Temporal, Text, and ID/key types.

Level	Detection Rule	Analysis Applied
ID/key	nunique > 95% of n AND non-numeric dtype	Excluded from analysis; flagged in profile
Binary	Exactly 2 unique values (any dtype)	Count, proportion, 95% CI
Nominal	object/string dtype, avg length <= 50 chars	Frequency table, mode, top-N
Ordinal	User-confirmed low-cardinality integer	Ordered freq table + cumulative % + full quantitative metrics (M labeled quasi-interval)
Discrete	integer dtype, meaningful count/quantity	Full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)
Continuous	float dtype, OR numeric with nunique > 95% of n (n > 10)	Full quantitative metrics + outlier flags + distribution shape
Temporal	datetime dtype or parseable date string	Monthly aggregated trend table (mean/median per period); gap detection not yet implemented
Text	object dtype, avg length > 50 chars	Top unigrams/bigrams, avg word count, vocabulary richness

4. Pipeline

Phase 0: Data Loading & Column Classification

Loads the file, applies user-confirmed column types (passed via --col_types), auto-detects unambiguous types for remaining columns, and prints a schema summary: column name | detected type | nunique | missing%.

Phase 1: Dataset Profile

01_dataset_profile.csv – shape, column types, missing%, uniqueness.

Phase 2: Data Quality Assessment

02_data_quality.csv – missing counts/%, duplicate rows, constant columns, outlier counts per Continuous/Discrete column (IQR method).

Phase 3: Univariate Analysis

Measurement-level-appropriate analysis for each column:

03_nominal_frequencies.csv – frequency + % for each Nominal column (top-N)
04_binary_summary.csv – count, proportion, 95% CI for all Binary columns
05_ordinal_distribution.csv – ordered frequency + cumulative %
06_ordinal_descriptives.csv – full quantitative metrics (M quasi-interval, Mdn, SD, IQR, SE, 95% CI, skewness, kurtosis, outliers)
07_discrete_descriptives.csv – full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)
08_continuous_descriptives.csv – full quantitative metrics (M, Mdn, Mode, SD, Var, Range, IQR, CV, Q1/Q3, skewness, kurtosis, SE, 95% CI, outlier counts)

Phase 4: Bivariate & Multivariate Analysis

Measurement-appropriate pairing analysis:

09_pearson_correlation.csv – Ratio-scale x Ratio-scale (Continuous + Discrete; r + p-value; pairwise deletion by default)
10_spearman_correlation.csv – Ordinal x Ordinal (rho + p-value; pairwise deletion by default)
11_grouped_by_{groupvar}.csv – Continuous/Discrete/Ordinal descriptives per Nominal group
12_crosstab_{nom1}_x_{nom2}.csv – Nominal x Nominal contingency tables

Phase 5: Specialized Analysis

13_text_{colname}.csv – word freq, top bigrams, avg word count, vocab size
14_temporal_trends.csv – key metrics over time by period

Phase 6: Excel Report

EXPLORATORY_REPORT.xlsx – APA-7th formatted workbook (B&W, no color fills). Sheet 0 “Summary” contains dataset dimensions and quality highlights. Sheets 1-N contain one sheet per generated CSV (skipped if absent). All sheets use Calibri 11 pt, table-number bold rows, title italic rows, bold column headers with bottom border, and data rows with top/bottom rules only. Skip with --no_excel.

Post-Script: Narrative Summary (Model-Authored)

IMPORTANT: This step is performed by Claude after the script finishes. EXPLORATORY_SUMMARY.md is produced by Claude reading the generated CSVs and writing a formatted document with the Write tool.

Before writing, read ${SKILL_DIR}/references/summary_template.md. This template shows the exact expected structure, table formats, and narrative style for each section. Use it as the structural blueprint â populate it with actual data from the generated CSVs. The template includes worked example rows for every table type.

Step 1 â Read the CSVs using the Read tool: 01_dataset_profile.csv, 02_data_quality.csv, 03_nominal_frequencies.csv, 04_binary_summary.csv, 05_ordinal_distribution.csv, 06_ordinal_descriptives.csv, 07_discrete_descriptives.csv, 08_continuous_descriptives.csv, 09_pearson_correlation.csv, 10_spearman_correlation.csv, all 11_grouped_by_*.csv, all 12_crosstab_*.csv, all 13_text_*.csv, 14_temporal_trends.csv.

Step 2 â Write TABLE/EXPLORATORY_SUMMARY.md directly using the Write tool.

Content requirements: One section per measurement level plus infrastructure sections. Required structure:

Dataset Overview (Source: 01_dataset_profile.csv, 14_temporal_trends.csv â temporal range and period count only)
Data Quality (Source: 02_data_quality.csv)
Nominal Variables (Source: 03_nominal_frequencies.csv)
Binary Variables (Source: 04_binary_summary.csv)
Ordinal Variables (Source: 05_ordinal_distribution.csv, 06_ordinal_descriptives.csv)
Discrete Variables (Source: 07_discrete_descriptives.csv)
Continuous Variables (Source: 08_continuous_descriptives.csv)
Bivariate Relationships (Source: 09_pearson_correlation.csv, 10_spearman_correlation.csv)
Group Comparisons (Source: 11_grouped_by_*.csv)
Cross-Tabulations (Source: 12_crosstab_*.csv)
Text Analysis (Source: 13_text_*.csv; omit section if no text columns)
Temporal Trends (Source: 14_temporal_trends.csv; omit section if no temporal column â full trend table and pattern interpretation here)
Output Files â bullet list of all files in the output directory

Table formatting rules:

Descriptive tables (ordinal 06, discrete 07, continuous 08): Split into two tables:
- Core table: variable, N_valid, missing_count, missing_pct, M (or M_quasi_interval), Mdn, Mode, SD, IQR, Skewness, Kurtosis
- Detail table: variable, Variance, Range, Min, Max, Q1_25th, Q3_75th, CV_pct, SE, CI95_lower, CI95_upper, mild_outliers_IQR1.5, extreme_outliers_IQR3.0, P10_10th, P90_90th (discrete/continuous only)
Narrow CSVs (frequencies 03, binary 04, ordinal distribution 05, correlations 09/10, crosstabs 12): Use the same columns as the source CSV.
Frequency tables: Include all rows from the CSV (script already caps at top-N).
Correlation tables: Include all pairs; bold rows where abs(r/rho) > 0.7.
Cross-tabs: Include full table from CSV. If a crosstab has >10 columns, note the CSV as the authoritative source and provide a prose summary instead.
Do not round further than the CSV already rounds.

General rules:

End each section with 1â3 sentences of interpretation.
Every content section heading must include a Source: annotation citing the CSV file(s) it draws from. Infrastructure sections (e.g., Output Files) are exempt.
NEVER derive findings from ad-hoc Python. All findings must come from the generated CSVs.
Section omission: If all source CSVs for a section are absent, omit the section entirely. If some source CSVs exist but others are absent, include the section and note “No [subtype] data available (file skipped by script)” for the missing part.

5. Output Directory Reference

TABLE/
âââ 01_dataset_profile.csv
âââ 02_data_quality.csv
âââ 03_nominal_frequencies.csv
âââ 04_binary_summary.csv
âââ 05_ordinal_distribution.csv
âââ 06_ordinal_descriptives.csv
âââ 07_discrete_descriptives.csv
âââ 08_continuous_descriptives.csv
âââ 09_pearson_correlation.csv
âââ 10_spearman_correlation.csv
âââ 11_grouped_by_{groupvar}.csv     (one per group variable)
âââ 12_crosstab_{nom1}_x_{nom2}.csv  (one per Nominal pair)
âââ 13_text_{colname}.csv            (one per text column)
âââ 14_temporal_trends.csv
âââ EXPLORATORY_REPORT.xlsx          (omitted with --no_excel)
âââ EXPLORATORY_SUMMARY.md

Files are omitted when no columns of the relevant type exist (e.g., no text columns means no 13_text_*.csv).

6. Interpreting the Summary

EXPLORATORY_SUMMARY.md uses a flagged-insight format rather than just reporting numbers:

Issue flags – “Column X has 34% missing – review before analysis”
Pattern highlights – “Strong positive correlation between views and likes (r=0.87, p<.001)”
Distribution notes – “Duration is right-skewed (skewness=3.1) – median better represents typical values”
Coverage summary – “768 records across 5 platforms; YouTube accounts for 62% of sample”
Measurement caveats – Ordinal means labeled as quasi-interval; Spearman used instead of Pearson for ranked scales

Flagging thresholds:

Flag	Threshold	Severity
Missing data	>20% of rows	review
Skewness	abs(skew) > 2	distribution
Kurtosis	abs(kurt) > 7	distribution
Correlation	abs(r/rho) > 0.7	relationship
CV	>100%	variability
ID-like column	nunique > 95% of n	exclusion

7. Design Principles

Exploratory-first – No confirmatory statistics; builds the picture before hypothesis testing
User-confirmed classification – Claude suggests measurement levels from heuristic rules; the user reviews and confirms before analysis runs
Measurement-appropriate – Uses median/IQR for Ordinal, Pearson for Continuous, Spearman for Ordinal, cross-tabulation for Nominal
Insight-flagging – Summary reports patterns and warnings, not just numbers
Dual output – CSVs for validation and import into results tables; markdown for interpretation
APA-compatible statistics – Full metric set (M, Mdn, SD, SE, 95% CI, skewness, kurtosis) ready for APA 7th reporting

8. Verification Checklist

Column classification table presented to user; user confirmed or corrected types
Confirmed types passed via --col_types and grouping columns via --group
Each detected column type has at least one output file
EXPLORATORY_SUMMARY.md written by Claude via Write tool
All 13 content sections present where applicable; sections for absent column types omitted per two-tier rule
Descriptive tables (ordinal/discrete/continuous) use core + detail split-table format
Every content section contains at least one populated markdown table and one narrative sentence. Infrastructure sections (e.g., Output Files) are exempt.
Numbers in the summary match the source CSVs exactly
EXPLORATORY_REPORT.xlsx present with “Summary” as first sheet + one sheet per CSV (unless --no_excel)
Excel workbook is B&W only (no color fills; no gridlines)
Bivariate outputs use correct method for each level pairing
High-cardinality numeric columns classified as Continuous, not ID (check 01_dataset_profile.csv)
Pearson correlations include both continuous and discrete ratio-scale columns
Every EXPLORATORY_SUMMARY.md section cites its source CSV file(s) in the heading
No ad-hoc Python used to derive findings outside the script pipeline
Script invoked with python (not python3) on Windows

Future Roadmap

Enhancement	Rationale
Visualization suite	Box plots, histograms, heatmaps, violin plots per measurement level
Normality tests (Shapiro-Wilk for n<5000)	Formal distribution assessment for Continuous variables
Hypothesis generation section	Bridge EDA to confirmatory research questions
`ydata-profiling` HTML report	One-line automated profiling complement
Interval vs. Ratio distinction	Full Stevens’ framework for specific research contexts

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台