data-analysis

📁 pjordan/claude-toolkit 📅 1 day ago
2
总安装量
1
周安装量
#69342
全站排名
安装命令
npx skills add https://github.com/pjordan/claude-toolkit --skill data-analysis

Agent 安装分布

amp 1
opencode 1
cursor 1
kimi-cli 1
codex 1
github-copilot 1

Skill 文档

Data Analysis Skill

Statistical analysis and data exploration workflows for effective data-driven insights.

Overview

This skill provides Claude with a systematic approach to data analysis, including exploratory data analysis (EDA), statistical testing, visualization recommendations, and insight generation.

When to Use

  • Analyzing datasets
  • Exploratory data analysis
  • Statistical hypothesis testing
  • Data visualization planning
  • Finding patterns and insights
  • Preparing analysis reports

Prerequisites

  • Basic statistics knowledge (mean, median, standard deviation, etc.)
  • Understanding of data types (categorical, numerical)
  • Familiarity with common data issues (missing values, outliers)

Instructions

Analysis Workflow

  1. Understand the Data

    • What is the source?
    • What questions need answering?
    • What’s the business context?
  2. Initial Exploration

    • Number of rows and columns
    • Data types of each column
    • Missing value patterns
    • Basic statistics (min, max, mean, etc.)
  3. Data Quality Check

    • Missing values: count and pattern
    • Outliers: identify using IQR or z-scores
    • Duplicates: check for repeated rows
    • Inconsistencies: data type issues, invalid values
  4. Descriptive Statistics

    • Central tendency (mean, median, mode)
    • Spread (standard deviation, range)
    • Distribution shape (skewness, kurtosis)
    • Percentiles (quartiles, deciles)
  5. Relationships

    • Correlations between variables
    • Group comparisons
    • Trends over time
    • Categorical associations
  6. Visualization Planning

    • Choose appropriate chart types
    • Identify key variables to visualize
    • Consider audience and purpose
  7. Insights & Recommendations

    • Summarize key findings
    • Answer original questions
    • Suggest next steps

Statistical Tests Decision Tree

Comparing Two Groups:

  • Numerical outcome → T-test or Mann-Whitney U
  • Categorical outcome → Chi-square or Fisher’s exact

Comparing Multiple Groups:

  • Numerical outcome → ANOVA or Kruskal-Wallis
  • Categorical outcome → Chi-square

Relationships:

  • Two numerical variables → Pearson or Spearman correlation
  • Numerical outcome with predictors → Regression analysis

Visualization Guidelines

Distributions:

  • Histogram: single numerical variable
  • Box plot: compare distributions across groups
  • Density plot: smooth distribution view

Comparisons:

  • Bar chart: categorical comparisons
  • Box plot: numerical across categories
  • Violin plot: distribution + comparison

Relationships:

  • Scatter plot: two numerical variables
  • Line chart: trends over time
  • Heatmap: correlation matrix

Compositions:

  • Pie chart: parts of a whole (use sparingly)
  • Stacked bar: composition over categories
  • Treemap: hierarchical composition

Examples

Example 1: Sales Data Analysis

Data:

date,product,sales,region
2024-01-01,A,150,North
2024-01-01,B,200,South
2024-01-02,A,180,North
...

Analysis Approach:

  1. Initial Questions:

    • Which products sell best?
    • Are there regional differences?
    • What are the trends over time?
  2. Exploration:

    # Shape and types
    print(df.shape)  # (365, 4)
    print(df.dtypes)
    
    # Missing values
    print(df.isnull().sum())
    
    # Basic statistics
    print(df['sales'].describe())
    
  3. Key Metrics:

    • Total sales by product
    • Average sales per region
    • Monthly sales trends
    • Sales distribution
  4. Visualizations:

    • Line chart: sales trend over time
    • Bar chart: total sales by product
    • Grouped bar: sales by product and region
    • Box plot: sales distribution by region
  5. Statistical Tests:

    • ANOVA: Do regions differ significantly?
    • Correlation: Relationship between date and sales
    • T-test: Product A vs B sales comparison
  6. Insights:

    • “Product B outsells Product A by 25% on average”
    • “North region shows 15% higher sales than South”
    • “Sales peak in Q4, suggesting seasonal pattern”

Example 2: Customer Survey Analysis

Data:

satisfaction,age,subscription_length
5,34,12
3,45,3
4,28,18
...

Analysis:

  1. Data Quality:

    • Check satisfaction scale (1-5)
    • Verify age ranges (18-100)
    • Identify missing values
  2. Descriptive Stats:

    • Mean satisfaction: 3.8
    • Age distribution: mostly 25-45
    • Subscription length: median 12 months
  3. Relationships:

    • Correlation: satisfaction vs. subscription_length (r = 0.6)
    • Age groups: No significant difference in satisfaction
  4. Segmentation:

    • New users (<6 months): satisfaction = 3.2
    • Established users (6-24 months): satisfaction = 4.1
    • Long-term users (>24 months): satisfaction = 4.3
  5. Visualization:

    • Histogram: satisfaction score distribution
    • Scatter: age vs satisfaction (colored by subscription length)
    • Box plot: satisfaction across subscription length groups
  6. Key Findings:

    • “Satisfaction strongly correlates with subscription length”
    • “New users are 28% less satisfied than established users”
    • “Focus retention efforts on first 6 months”

Best Practices

Do’s

  • Always start with data quality checks
  • Use appropriate statistical tests
  • Consider practical significance, not just statistical
  • Visualize before and during analysis
  • Document assumptions
  • Provide context for findings

Don’ts

  • Don’t assume causation from correlation
  • Don’t ignore missing data patterns
  • Don’t over-interpret small effects
  • Don’t use inappropriate tests
  • Don’t cherry-pick results
  • Don’t forget to check assumptions

Common Pitfalls

Simpson’s Paradox

A trend in aggregated data can reverse when data is split into groups. Always check subgroups.

Multiple Testing

Running many tests increases false positive risk. Adjust significance levels accordingly.

Outliers

Outliers can dramatically affect results. Always identify and consider their impact.

Sample Size

Small samples may not generalize. Report sample sizes and confidence intervals.

Output Format

Structure analysis as:

## Executive Summary
[Brief overview of key findings]

## Data Overview
- Rows: X
- Columns: Y
- Time period: Z
- Key variables: ...

## Data Quality
[Missing values, outliers, issues]

## Descriptive Statistics
[Central tendency, spread, distribution]

## Key Relationships
[Correlations, comparisons, trends]

## Visualizations
[Recommended charts and what they show]

## Statistical Tests
[Tests performed and results]

## Insights
[Bullet points of key findings]

## Recommendations
[Action items based on analysis]

## Limitations
[Data quality issues, assumptions, caveats]

Tools and Resources

Python libraries:

  • pandas: Data manipulation
  • numpy: Numerical operations
  • matplotlib/seaborn: Visualization
  • scipy: Statistical tests
  • statsmodels: Advanced statistics

R packages:

  • dplyr: Data manipulation
  • ggplot2: Visualization
  • tidyr: Data tidying
  • stats: Statistical tests

Limitations

  • Can’t execute code (provide code suggestions only)
  • Can’t see actual data (work from descriptions)
  • Statistical advice is general (not specialized consulting)
  • Can’t create visualizations (can recommend them)

Related Skills

  • Data Cleaning (preprocessing focus)
  • Statistical Modeling (advanced techniques)
  • Data Visualization (chart design)

Author

Claude Toolkit Community

Version

  • Created: 2025-01-18
  • Last Updated: 2025-01-18
  • Version: 1.0.0