data-visualization

📁 delphine-l/claude_global 📅 Today

总安装量

周安装量

#63903

全站排名

安装命令

npx skills add https://github.com/delphine-l/claude_global --skill data-visualization

Agent 安装分布

amp 1

cline 1

opencode 1

cursor 1

continue 1

kimi-cli 1

Skill 文档

Data Visualization Best Practices

Expert guidance for creating publication-quality scientific visualizations, avoiding common pitfalls, and optimizing figure clarity.

When to Use This Skill

Creating figures for scientific publications
Debugging misleading or distorted visualizations
Optimizing figure layouts and element sizes
Choosing appropriate plot types for data characteristics
Ensuring statistical annotations fit properly
Generating images for sharing with Claude or other AI tools

Common Pitfalls with Log-Scale Plots

Violin Plots on Log Scales

Problem: Violin plots use Kernel Density Estimation (KDE) in linear space, then the axis is transformed to log scale. This causes severe visual distortion where the violin shape doesn’t accurately represent the actual data distribution.

Symptoms:

Smooth, blob-like violin shapes on log axes
Visual representation suggests even distribution but histogram shows heavy clustering
Particularly problematic with right-skewed data heavily concentrated in one region

Example of the problem:

# â BAD: Violin plot on log scale
import matplotlib.pyplot as plt
import numpy as np

data = np.random.exponential(10, 1000)  # Right-skewed data
fig, ax = plt.subplots()
ax.violinplot([data])
ax.set_yscale('log')  # Distorts the violin shape!
# Result: Smooth violin that doesn't show the true concentration at low values

Solution 1: Use boxplots instead

# â GOOD: Boxplot on log scale
ax.boxplot([data])
ax.set_yscale('log')  # Boxplot statistics remain meaningful

Why boxplots work: Boxplot statistics (median, quartiles, outliers) are calculated as specific values, not density estimates, so they remain meaningful on log scales.

Solution 2: Log-transform data first

# â ALTERNATIVE: Log-transform data first, then use violin on linear axis
log_data = np.log10(data[data > 0])
ax.violinplot([log_data])
ax.set_ylabel('log10(Value)')
# Keep linear axis - violin now accurately represents log-space distribution

Solution 3: Use histograms with log axes

# â GOOD: Histogram with log y-axis shows true frequency distribution
ax.hist(data, bins=30)
ax.set_xscale('log')  # Data axis
ax.set_yscale('log')  # Frequency axis - shows concentration clearly

Impact

This pitfall can lead to misleading figures in publications where the visual representation contradicts the actual data distribution. In our VGP curation analysis, this affected 4 different figures before correction.

Outlier Handling: Show First, Decide Later

Default stance: Show ALL data points in initial visualizations

# CORRECT - show all data
plt.boxplot(data, showfliers=True)

# AVOID initially - hides potentially important data
plt.boxplot(data, showfliers=False)

Rationale:

Outliers may be biologically meaningful
Filtering decisions should be informed by seeing complete data
Easy to filter later, hard to know what you missed
Patterns in outliers can reveal data quality issues

Workflow:

Generate figures with all data (showfliers=True)
Review with domain expert
Decide case-by-case if outliers should be excluded
Document rationale for any exclusions

Only filter outliers when:

Technical artifact confirmed (e.g., processing error)
Prevents seeing relevant patterns in bulk of data
Documented and justified in methods
Alternative view with all data provided in supplement

Example documentation:

# Remove known technical outlier
"""
Excluded assembly GCA_123456 from Figure 2 analysis:
- Scaffold N50 = 500 Gb (500Ã larger than genome size)
- Confirmed as assembly processing error in NCBI notes
- Other metrics for this assembly are valid and included in other figures
"""

Publication Figure Refinement

Element Sizing for Clarity

When figures are cluttered or elements overlap:

# Point sizes: Reduce for dense data
ax.scatter(..., s=25, alpha=0.5)  # Down from s=60

# Line widths: Thinner lines reduce visual clutter
ax.plot(..., linewidth=1.5)  # Down from 2.5

# Text sizes: Prevent overlap
ax.text(..., fontsize=8)  # Down from 10

# Error bar cap sizes
ax.errorbar(..., capsize=5)  # Standard readable size

P-value and Annotation Positioning

Problem: Statistical annotations (p-values, significance stars) often placed outside plot bounds

Solution: Position relative to data range with explicit limits

# Calculate data range
y_max = max([d.max() for d in data_list])
y_min = min([d.min() for d in data_list])

# Position annotations within plot
y_pos = y_max * 0.92  # 92% of max, not 105% which goes outside

ax.text(x_pos, y_pos, 'p < 0.001***', ha='center', fontsize=9)

# Set explicit limits with headroom
ax.set_ylim(y_min * 0.95 if y_min > 0 else -5, y_max * 1.05)

Panel Layout Testing

Test both orientations to find clearest presentation:

# Side-by-side (good for comparing distributions)
fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# Stacked vertically (good for larger individual panels)
fig, axes = plt.subplots(2, 1, figsize=(10, 14))

Decision criteria:

Side-by-side: Better for direct left-right comparison
Stacked: Better when each panel needs more space
Let user feedback guide the choice

Adding Sample Sizes to Legends

Why Sample Sizes Matter: Readers need to assess statistical power at a glance. Include sample sizes directly in legend labels for scientific figures.

Pattern 1: Simple Legend with Sample Sizes

# Calculate sample sizes once
category_sizes = df.groupby('category').size().to_dict()

# Use in scatter plot legend
for category in categories:
    data = df[df['category'] == category]
    ax.scatter(data['x'], data['y'],
              label=f"{category} (n={category_sizes[category]})")

ax.legend(loc='best')

Pattern 2: Custom Legend for Complex Plots

When you have multiple marker types (e.g., technology + category), create custom legend:

from matplotlib.lines import Line2D

# Calculate sizes
category_sizes = df.groupby('category').size().to_dict()

# Create custom legend elements
custom_lines = [
    Line2D([0], [0], color=colors['Cat1'], marker='o', linestyle='', markersize=8),
    Line2D([0], [0], color=colors['Cat2'], marker='o', linestyle='', markersize=8),
]

custom_labels = [
    f"Category 1 (n={category_sizes['Cat1']})",
    f"Category 2 (n={category_sizes['Cat2']})",
]

ax.legend(custom_lines, custom_labels, loc='best', fontsize=9)

Pattern 3: Multi-Panel Figures – Show Legend Once

For 2Ã3 or similar grids, show legend only in first subplot:

# Calculate once, use in all panels
category_sizes = df.groupby('category').size().to_dict()

for idx, metric in enumerate(metrics):
    ax = axes[idx]

    for category in categories:
        # Only add label for first subplot
        if idx == 0:
            label_text = f"{category} (n={category_sizes[category]})"
        else:
            label_text = ''

        ax.scatter(..., label=label_text)

    if idx == 0:
        ax.legend(loc='best')

Best Practices:

Calculate sizes once at the top (efficient, avoids repeated computation)
Use consistent format: Category Name (n=123)
For small panels, use fontsize=7-9
Consider ncol=1 for vertical layout if space allows
Place sample sizes in legend OR as text annotations, not both

Publication Standards:

Nature/Science: Strongly recommended for all comparative figures
PLOS: Required for sample size transparency
Cell: Expected in methods or figure legends
General guideline: Always include when comparing groups

Example: Temporal Analysis with Categories

# Temporal trends by category
category_sizes = df.groupby('category').size().to_dict()

fig, ax = plt.subplots(figsize=(10, 6))

for category in ['Phased+Dual', 'Phased+Single', 'Pri/alt+Single']:
    data = df[df['category'] == category]
    ax.scatter(data['year'], data['quality_metric'],
              color=colors[category],
              label=f"{category} (n={category_sizes[category]})",
              alpha=0.6, s=40)

ax.set_xlabel('Year')
ax.set_ylabel('Assembly Quality')
ax.legend(loc='best', fontsize=9)

Visualizing Category Proportions Over Time

Use Case: Show how the relative proportions of categories have changed over time.

Dual-Panel Approach: Proportions + Absolute Counts

Show both relative and absolute trends using side-by-side panels.

Left Panel: Stacked area chart (proportions sum to 100%) Right Panel: Stacked bar chart (shows actual sample sizes)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Calculate counts and proportions by year
year_category_counts = df.groupby(['year', 'category']).size().unstack(fill_value=0)
year_category_proportions = year_category_counts.div(
    year_category_counts.sum(axis=1), axis=0
) * 100

# Dual-panel figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Panel 1: Stacked area (proportions)
years = year_category_proportions.index
categories = ['Cat1', 'Cat2', 'Cat3']

# Calculate total sizes for legend
total_counts = df.groupby('category').size().to_dict()

bottom = np.zeros(len(years))
for category in categories:
    values = year_category_proportions[category].values
    ax1.fill_between(years, bottom, bottom + values,
                     label=f"{category} (n={total_counts[category]})",
                     color=colors[category], alpha=0.7)
    bottom += values

ax1.set_xlabel('Year', fontsize=12)
ax1.set_ylabel('Proportion (%)', fontsize=12)
ax1.set_title('Category Proportions Over Time', fontsize=14, fontweight='bold')
ax1.set_ylim(0, 100)
ax1.legend(loc='best', fontsize=10)
ax1.grid(axis='y', alpha=0.3)
ax1.xaxis.set_major_locator(plt.MaxNLocator(integer=True))

# Panel 2: Stacked bar (absolute counts)
year_category_counts[categories].plot(
    kind='bar', stacked=True, ax=ax2,
    color=[colors[c] for c in categories],
    width=0.7, edgecolor='black', linewidth=0.5
)

ax2.set_xlabel('Year', fontsize=12)
ax2.set_ylabel('Number of Samples', fontsize=12)
ax2.set_title('Absolute Counts by Category', fontsize=14, fontweight='bold')
ax2.legend(title='Category',
          labels=[f"{c} (n={total_counts[c]})" for c in categories],
          loc='upper left', fontsize=9)
ax2.grid(axis='y', alpha=0.3)
ax2.set_xticklabels([int(y) for y in years], rotation=0)

# Add totals on top of bars
for i, year in enumerate(years):
    total = year_category_counts.loc[year].sum()
    ax2.text(i, total + 2, str(int(total)),
            ha='center', va='bottom', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig('category_proportions.png', dpi=150, bbox_inches='tight')

Why Both Panels?

Proportions (Area Chart):

Shows relative shifts in category usage
Easy to see if one category is growing/declining
Sums to 100% (intuitive interpretation)

Absolute Counts (Bar Chart):

Shows actual sample sizes (statistical power)
Reveals total data volume changes
Helps interpret proportion changes (growing proportion of shrinking pie?)

Together: Complete picture of temporal trends

Styling Tips:

Colors: Use colorblind-safe palette consistently across both panels
Edge colors: Black edges on bars improve readability (linewidth=0.5)
Totals: Add count labels above stacked bars for context
X-axis: Integer years, not decimals (use MaxNLocator(integer=True))
Legend: Include total sample sizes: Category (n=123)

When to Use:

Tracking category adoption over time
Showing methodology shifts in field
Demonstrating changing experimental approaches
Any temporal categorical composition analysis

Color Schemes

Colorblind-Friendly Palettes

Standard palette for dual comparisons:

COLORS = {
    'Group1': '#0173B2',  # Blue
    'Group2': '#DE8F05'   # Orange
}

Accessible to most common color vision deficiencies.

Comprehensive Colorblind-Safe Color Palettes for Scientific Figures

Problem: Poor Color Accessibility

Common issue: Default color schemes often use green-blue or red-green combinations that are indistinguishable for colorblind viewers (~8% of population).

Examples of problematic combinations:

Green + Blue (similar for deuteranopia/protanopia)
Red + Green (classic colorblindness issue)
Light blue + Dark blue (insufficient contrast)

Okabe-Ito Palette (Recommended by Nature)

The gold standard for scientific figures, developed by Masataka Okabe and Kei Ito.

Complete 8-color palette (hex codes):

okabe_ito = {
    'orange': '#E69F00',
    'sky_blue': '#56B4E9',
    'bluish_green': '#009E73',
    'yellow': '#F0E442',
    'blue': '#0072B2',
    'vermillion': '#D55E00',
    'reddish_purple': '#CC79A7',
    'black': '#000000'
}

For 3 categories (maximum distinction):

# Best combination for 3 categories
category_colors = {
    'Category_A': '#0072B2',    # Blue
    'Category_B': '#E69F00',    # Orange
    'Category_C': '#CC79A7'     # Reddish Purple
}

Why this combination:

Blue (cool) + Orange (warm) + Purple (neutral) = maximum perceptual separation
Works for all types of colorblindness (deuteranopia, protanopia, tritanopia)
Blue-orange is universally distinguishable
No green-blue or red-green confusion

For 5+ categories, use additional colors from the palette:

five_colors = {
    'Cat_1': '#0072B2',    # Blue
    'Cat_2': '#E69F00',    # Orange
    'Cat_3': '#CC79A7',    # Reddish Purple
    'Cat_4': '#D55E00',    # Vermillion
    'Cat_5': '#F0E442'     # Yellow
}

Paul Tol’s Bright Palette (Alternative)

Another scientifically validated option:

paul_tol_bright = {
    'blue': '#4477AA',
    'red': '#EE6677',
    'green': '#228833',
    'yellow': '#CCBB44',
    'cyan': '#66CCEE',
    'purple': '#AA3377',
    'grey': '#BBBBBB'
}

Implementation in Matplotlib/Seaborn

Set up colorblind-safe palette:

import matplotlib.pyplot as plt
import seaborn as sns

# Okabe-Ito colors for 3 categories
colors = ['#0072B2', '#E69F00', '#CC79A7']

# Apply to matplotlib
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=colors)

# Or use directly in plots
fig, ax = plt.subplots()
for i, category in enumerate(['A', 'B', 'C']):
    ax.plot(x, y[i], color=colors[i], label=category)

For categorical plots (seaborn):

# Define palette dictionary
palette = {
    'Phased+Dual': '#0072B2',
    'Phased+Single': '#E69F00',
    'Pri/alt+Single': '#CC79A7'
}

# Use in seaborn
sns.boxplot(data=df, x='category', y='value', palette=palette)

Best Practices

Avoid red-green combinations – Most common colorblindness type
Use patterns/markers too – Combine color with shapes for redundancy
Test your figures – Use colorblindness simulators online
Document your palette – Add comment explaining choice
Be consistent – Use same colors for same categories across all figures

Example: Complete Figure Setup

# Okabe-Ito palette for 3 categories
category_colors = {
    'Method_A': '#0072B2',      # Blue (Okabe-Ito)
    'Method_B': '#E69F00',      # Orange (Okabe-Ito)
    'Method_C': '#CC79A7'       # Reddish Purple (Okabe-Ito)
}

# Also use different markers for redundancy
markers = {
    'Method_A': 'o',  # circle
    'Method_B': 's',  # square
    'Method_C': '^'   # triangle
}

# Plot with both color and marker distinction
for method in ['Method_A', 'Method_B', 'Method_C']:
    data = df[df['method'] == method]
    plt.scatter(data['x'], data['y'],
                color=category_colors[method],
                marker=markers[method],
                label=method, s=50, alpha=0.7)

plt.legend()
plt.title('Analysis Results (Colorblind-Safe)')

When to Use Which Palette

Okabe-Ito:

Scientific publications (recommended by Nature)
3-8 categorical variables
Need maximum accessibility
Standard for academic figures

Paul Tol:

Alternative when you want different aesthetics
Good for presentations
Widely used in Europe

Seaborn ‘colorblind’:

Quick matplotlib/seaborn integration
Based on similar principles
Built-in convenience

Resources

Okabe-Ito palette: https://jfly.uni-koeln.de/color/
Paul Tol’s schemes: https://personal.sron.nl/~pault/
Colorblind simulator: https://www.color-blindness.com/coblis-color-blindness-simulator/
Venngage guide: https://venngage.com/blog/color-blind-friendly-palette/

Real Example: VGP Assembly Analysis

# Before: Similar blues caused confusion
old_colors = {
    'Phased+Dual': '#1976D2',    # Dark blue
    'Phased+Single': '#4FC3F7',  # Light blue - TOO SIMILAR!
    'Pri/alt+Single': '#66BB6A'  # Green - confusing with blue
}

# After: Okabe-Ito palette with maximum distinction
new_colors = {
    'Phased+Dual': '#0072B2',      # Blue
    'Phased+Single': '#E69F00',    # Orange - DISTINCT!
    'Pri/alt+Single': '#CC79A7'    # Purple - DISTINCT!
}

This ensures all readers can distinguish categories regardless of color vision deficiency.

Image Size Constraints for Claude API

CRITICAL: When generating images to share with Claude (for review, debugging, etc.), images must not exceed 8000 pixels in either dimension.

Check Image Size Before Opening

Always verify image dimensions before trying to display them in Claude:

from PIL import Image

# Check dimensions
img = Image.open('figure.png')
print(f"Image size: {img.width}x{img.height}")

if img.width > 8000 or img.height > 8000:
    print(f"â ï¸  WARNING: Image too large for Claude API!")
    print(f"   Claude limit: 8000px max dimension")
    print(f"   Your image: {img.width}x{img.height}")

Set Size Constraints When Generating Figures

For matplotlib/seaborn figures:

import matplotlib.pyplot as plt

# Set figure size to stay under Claude's limits
# Rule of thumb: Keep figsize under (80, 80) at 100 DPI
# Or under (26, 26) at 300 DPI
fig, ax = plt.subplots(figsize=(16, 12))  # Safe: 1600x1200 at 100 DPI

# When saving, control DPI to stay under limits
# 7999px / 300 DPI = 26.6 inches max
# 7999px / 100 DPI = 79.9 inches max
plt.savefig('figure.png', dpi=300, bbox_inches='tight')  # Max ~26x26 inches

# For very large figures, use lower DPI
plt.savefig('large_figure.png', dpi=100, bbox_inches='tight')  # Max ~80x80 inches

Safe figure size presets:

# Publication quality (300 DPI) - fits Claude limit
FIG_SIZES = {
    'single_column': (3.5, 4),      # 1050x1200 px
    'double_column': (7, 5),        # 2100x1500 px
    'full_page': (7, 9),            # 2100x2700 px
    'poster': (20, 15),             # 6000x4500 px - safe for Claude
    'max_claude': (26, 26),         # 7800x7800 px - maximum safe size
}

fig, ax = plt.subplots(figsize=FIG_SIZES['double_column'])
plt.savefig('figure.png', dpi=300, bbox_inches='tight')

Resize Oversized Images

If you have an existing image that’s too large:

from PIL import Image

def resize_for_claude(image_path, max_dim=7999, output_path=None):
    """
    Resize image to fit Claude's API constraints.

    Args:
        image_path: Path to input image
        max_dim: Maximum dimension (default 7999 for safety margin)
        output_path: Output path (default: adds '_resized' to filename)
    """
    img = Image.open(image_path)

    # Check if resize needed
    if img.width <= max_dim and img.height <= max_dim:
        print(f"â Image OK: {img.width}x{img.height}")
        return image_path

    # Calculate new size preserving aspect ratio
    img.thumbnail((max_dim, max_dim), Image.Resampling.LANCZOS)

    # Save
    if output_path is None:
        base = image_path.rsplit('.', 1)[0]
        ext = image_path.rsplit('.', 1)[1]
        output_path = f"{base}_resized.{ext}"

    img.save(output_path)
    print(f"â Resized: {image_path}")
    print(f"  Original: {Image.open(image_path).size}")
    print(f"  New: {img.size}")
    print(f"  Saved: {output_path}")

    return output_path

# Usage
resize_for_claude('large_figure.png')

Quick Checks

Bash one-liner to check size:

# Using ImageMagick
identify figure.png | grep -o '[0-9]*x[0-9]*'

# Check if oversized
python3 -c "from PIL import Image; img=Image.open('figure.png'); print(f'{img.width}x{img.height}'); exit(0 if img.width<=7999 and img.height<=7999 else 1)" && echo "OK" || echo "TOO LARGE"

Add to notebook imports:

# Standard imports for Claude-compatible figures
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image

# Set global figure size limit
plt.rcParams['figure.max_open_warning'] = 50
MAX_CLAUDE_DIM = 7999  # Claude API limit: 8000px, use 7999 for safety

def save_figure(filename, dpi=300, **kwargs):
    """Save figure with Claude size constraint check."""
    plt.savefig(filename, dpi=dpi, bbox_inches='tight', **kwargs)

    # Verify size
    img = Image.open(filename)
    if img.width > MAX_CLAUDE_DIM or img.height > MAX_CLAUDE_DIM:
        print(f"â ï¸  WARNING: {filename} exceeds Claude limit!")
        print(f"   Size: {img.width}x{img.height} (max: {MAX_CLAUDE_DIM})")
        print(f"   Resizing...")
        img.thumbnail((MAX_CLAUDE_DIM, MAX_CLAUDE_DIM), Image.Resampling.LANCZOS)
        img.save(filename)
        print(f"   â Resized to: {img.width}x{img.height}")
    else:
        print(f"â Saved {filename}: {img.width}x{img.height}")

Common Scenarios

High-DPI screenshots from Retina displays:

Retina screenshots are 2x pixel density
A full-screen 4K monitor screenshot can be 7680×4320 (OK)
A 5K monitor screenshot is 10240×5760 (TOO LARGE!)
Solution: Resize before sharing or take partial screenshots

Multi-panel figures:

# Instead of one huge figure with many panels
fig, axes = plt.subplots(4, 4, figsize=(40, 40))  # Could be 12000x12000 px!

# Split into smaller figures
for i in range(4):
    fig, axes = plt.subplots(2, 2, figsize=(12, 12))  # 3600x3600 px - safe!
    # Plot subset of panels
    plt.savefig(f'figure_part{i}.png', dpi=300, bbox_inches='tight')

Error Recovery

If you get the error:

API Error: 400 ... image dimensions exceed max allowed size: 8000 pixels

The error is stuck in conversation history. To recover:

Skip the message: “Please ignore the oversized image in the previous message”
Resize and resend: Use resize_for_claude() function above
Use /safe-clear: Save context and start fresh (if command available)

Jupyter Notebook Image Size Issues

Oversized Images from Combined Output

Problem: Jupyter notebook saves figures as extremely tall images (e.g., 1541 x 42,011 pixels) that exceed the 8000 pixel limit.

Cause: When a cell generates both a figure AND text output (print statements, statistical results), Jupyter captures both as a single tall image. The text output is rendered as image pixels below the figure, creating a massive combined image.

Symptoms:

Image dimensions like 1541 x 42,011 pixels (height >> 8000)
Figure displays fine in notebook but won’t display in Claude or other tools
Error: “image dimensions exceed max allowed size: 8000 pixels”

Example of the problem:

# Cell that creates oversized image
fig, axes = plt.subplots(2, 3, figsize=(12, 8))

# ... plotting code ...

plt.tight_layout()
plt.savefig('figure.png', dpi=150, bbox_inches='tight')
plt.show()

# Text output after figure (PROBLEM!)
print("Statistical Results:")
print(f"Spearman correlation: rho={rho:.3f}, p={pval:.4f}")
# Multiple print statements create tall text output
# Jupyter combines this with figure into one 42K pixel tall image

Solution 1: Split into multiple figures

Instead of creating one large multi-panel figure, split into smaller figures:

# â GOOD: Split 2Ã3 grid into two 1Ã3 grids
# Figure 1: First 3 panels
fig1, axes1 = plt.subplots(1, 3, figsize=(10, 3.5))
# ... plot first 3 panels ...
plt.savefig('figure_part1.png', dpi=150, bbox_inches='tight')
plt.show()

# Figure 2: Second 3 panels
fig2, axes2 = plt.subplots(1, 3, figsize=(10, 3.5))
# ... plot second 3 panels ...
plt.savefig('figure_part2.png', dpi=150, bbox_inches='tight')
plt.show()

Solution 2: Separate text output into different cell

Move print statements to a separate cell after the figure:

# Cell 1: Just the figure
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
# ... plotting code ...
plt.savefig('figure.png', dpi=150, bbox_inches='tight')
plt.show()

# Cell 2: Text output (separate!)
print("Statistical Results:")
print(f"Spearman correlation: rho={rho:.3f}, p={pval:.4f}")

Solution 3: Suppress text output in figure cell

# Capture results without printing
results = []
for category in categories:
    rho, pval = stats.spearmanr(x, y)
    results.append({'category': category, 'rho': rho, 'pval': pval})

# Create figure (no print statements!)
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
# ... plotting code ...
plt.savefig('figure.png', dpi=150, bbox_inches='tight')
plt.show()

# Display results in separate cell or as DataFrame
results_df = pd.DataFrame(results)

When to split figures:

Multi-panel figures with many subplots (3+ rows Ã 2+ columns)
Any figure where dimensions approach 8000 pixels
When cell has significant text output after figure
When total cell output height feels very long in notebook

Prevention:

Use the save_figure() helper from jupyter-notebook skill (auto-checks size)
Keep figure cells focused on visualization only
Save statistical results to CSV files instead of printing
Use separate markdown cells for result interpretation

Matplotlib Text Positioning Creates Empty Space

Problem: Saved figure has large area of empty white space above the actual plot, making the figure much taller than necessary.

Cause: Using transform=ax.get_xaxis_transform() with data coordinate y-values positions text far outside the plot bounds. The transform uses axis-relative coordinates (0-1) for x but data coordinates for y, so large y-values create huge positioning errors.

Symptoms:

Empty white space at top (or bottom) of saved figure
Text annotations not visible or way off the plot
tight_layout() and pad_inches adjustments don’t fix it
Problem persists even after reducing figure size

Example of the problem:

# â BAD: Mixing coordinate systems
for category in categories:
    data = df[df['category'] == category]
    ax.scatter(data['x'], data['y'])

    # Add significance marker
    y_max = data['y'].max()  # e.g., y_max = 2000000000 (2 billion)

    # PROBLEM: y_max is in data coordinates, transform expects 0-1!
    ax.text(0.5, y_max * 0.95, '***',
           transform=ax.get_xaxis_transform(),  # x in 0-1, y in data coords
           ha='center', fontsize=12)
    # This positions text at y = 1.9 billion in the mixed coordinate system!

plt.savefig('figure.png', dpi=150, bbox_inches='tight')
# Result: Massive empty space with text way above visible plot

Why this happens:

ax.get_xaxis_transform() uses axis coordinates (0-1) for x, data coordinates for y
y_max * 0.95 for scaffold N50 might be 1.9 billion (1.9e9)
Transform interprets this as 1.9 billion axis units above the plot
bbox_inches='tight' includes this invisible text, creating empty space

Solution 1: Position within data range (RECOMMENDED)

Calculate position within the actual data coordinate system:

# â GOOD: Position within plot bounds using pure data coordinates
for category in categories:
    data = df[df['category'] == category]
    ax.scatter(data['x'], data['y'])

    # Get actual data range
    y_min, y_max = ax.get_ylim()

    # Position at 90% of the visible range
    y_pos = y_min + (y_max - y_min) * 0.90

    # Use data coordinates (no transform needed)
    ax.text(0.5, y_pos, '***',
           ha='center', va='center', fontsize=12)

Solution 2: Use axis transform correctly with 0-1 coordinates

If you want to use the transform, use 0-1 range for y:

# â GOOD: Both x and y in 0-1 axis coordinates
ax.text(0.5, 0.90, '***',
       transform=ax.transAxes,  # Both x and y in 0-1 range
       ha='center', va='center', fontsize=12)

Solution 3: Use annotate with xycoords

# â GOOD: Explicit coordinate specification
ax.annotate('***',
           xy=(x_pos, y_pos),          # Data coordinates
           xycoords='data',
           ha='center', va='center', fontsize=12)

Coordinate Transform Quick Reference:

Transform	X coordinate	Y coordinate	Use case
`None` (default)	Data	Data	Normal plotting
`ax.transAxes`	0-1 (axis)	0-1 (axis)	Position relative to axes
`ax.get_xaxis_transform()`	0-1 (axis)	Data	Span markers, axis labels
`ax.get_yaxis_transform()`	Data	0-1 (axis)	Y-axis annotations

When to use each approach:

Data coordinates (no transform): Annotations tied to specific data points
Axis coordinates (transAxes): Labels in fixed positions (e.g., panel letters)
Mixed transforms: Advanced use only, requires careful coordinate scaling

Debugging tips:

If empty space appears, check for text/annotation calls with large y-values
Use ax.get_ylim() to verify reasonable y-coordinate range
Temporarily comment out text/annotation calls to identify culprit
Verify saved figure dimensions match expected size

Prevention:

Prefer pure data coordinates for most annotations
Only use transforms when specifically needed
Always verify coordinate ranges match transform type
Test saved figure size after adding annotations

Common Matplotlib Issues and Fixes

Float Year Labels on X-Axis

Problem: X-axis shows decimal years (2021.0, 2021.5, 2022.0) instead of clean integers (2021, 2022, 2023).

Cause: Matplotlib’s default tick formatter displays float values with decimals when the data type is float.

Solution: Use MaxNLocator with integer=True

import matplotlib.pyplot as plt

# After creating plot
ax.scatter(df['year'], df['value'])

# Fix x-axis to show only integer years
ax.xaxis.set_major_locator(plt.MaxNLocator(integer=True))

Why This Works:

MaxNLocator(integer=True) constrains tick locations to integers
Works even when underlying data is float (e.g., release_year column as float64)
Automatically chooses appropriate spacing (won’t show every year if range is large)

Common Use Case: Temporal analyses where year data is stored as float but should display as integer for readability.

Example:

# Data with float years
df['release_year'] = [2021.0, 2022.0, 2023.0, 2024.0, 2025.0]

fig, ax = plt.subplots()
ax.scatter(df['release_year'], df['metric'])

# Without fix: x-axis shows 2021.0, 2021.5, 2022.0, 2022.5, ...
# With fix: x-axis shows 2021, 2022, 2023, 2024, 2025
ax.xaxis.set_major_locator(plt.MaxNLocator(integer=True))

Best Practices

Always check log-scale plots: If using KDE-based plots (violin, ridge) on log axes, verify against histogram
Test element sizes: Regenerate figures with different sizes to find optimal clarity
Explicit axis limits: Don’t rely on automatic limits when annotations are added
Consistent styling: Use seaborn context and style for publication consistency
High DPI: Save at 300 DPI minimum for publication (dpi=300, bbox_inches='tight')
Optimize axis ranges for data distribution: When data is concentrated in narrow range, adjust axis limits to improve visibility
Check image dimensions: Verify size before sharing with Claude (max 7999×7999 pixels)
Set size constraints in scripts: Use safe figure sizes when generating images programmatically

Axis Range Optimization for Compressed Distributions

When data is concentrated at one end of range:

Problem: Cumulative distributions all at 80-100% look compressed with 0-100% Y-axis

Solution: Adjust axis limits to focus on data range

# For chromosome assignment cumulative distribution (mostly 80-100%)
ax.set_ylim(50, 100)  # Start at 50% instead of 0%

# For legend placement with adjusted range
ax.legend(loc='upper left')  # Prevents overlap with curves at top-right

When to adjust axis ranges:

â Data concentrated in narrow range (e.g., 80-100%)
â Improves visibility of differences
â All relevant data still visible
â Makes small differences more apparent

When NOT to adjust:

â Would hide meaningful outliers
â Creates misleading visual impression
â Data actually spans full range
â Standard in field to show full range (0-100%)

Best practice: Show both views if controversial

Main figure: Zoomed range for clarity
Supplementary: Full range for context

Scientific Figure Descriptions for Publications

Structure for Multi-Panel Figures

Opening Sentence: Overview + sample size + stratification

**[Analysis type] of [N] [unit]** across [timeframe/condition], stratified by
[categories]: Category1 (n=X), Category2 (n=Y), Category3 (n=Z).

Panel Descriptions: For each panel:

**[Metric Name]** (panel location): [Pattern observed]. [Statistical test]
(Ï=[value], p=[value]) shows [interpretation]. [Biological/technical context].

Closing Interpretation: Synthesize findings

**Interpretation:** [Overall pattern]. [Comparison across categories].
[Methodological implications]. [Connection to study goals].

Example: Temporal Trends Figure

### Figure 4. Temporal trends in assembly quality metrics for HiFi-only assemblies (2021-2025)

**Temporal analysis of six assembly quality metrics across 268 HiFi assemblies**
spanning 2021-2025, stratified by assembly and curation method: Phased+Dual
(n=101, blue), Phased+Single (n=42, orange), and Pri/alt+Single (n=125, purple).
Each panel displays individual assembly measurements (points) with linear
regression trend lines (dashed) for each category. Trend significance was
assessed using Spearman correlation (Î±=0.05).

**Key Findings:**

**Scaffold N50** (upper left): Pri/alt+Single assemblies show significant
improvement over time (Ï=0.32, p=2.7Ã10â»â´), increasing from ~100 Mb to ~700 Mb,
while Phased assemblies remain stable at ~100-200 Mb. This suggests technological
improvements in single-assembly methods during the HiFi era.

**Gap Density** (upper middle): All HiFi assemblies collectively show decreasing
gap density over time (Ï=-0.17, p=0.0057), indicating improved sequence continuity.

[Additional panels...]

**Interpretation:** Temporal trends are category-specific and metric-dependent.
Pri/alt+Single assemblies show quality improvements (N50, gap density) consistent
with technological advancement during 2021-2025. Phased assemblies remain stable
across most metrics, suggesting their quality is primarily methodology-determined.

Quantitative Details to Include

Always include:

â Sample sizes (n=X) for each group
â Statistical test used (Spearman, Mann-Whitney, etc.)
â Effect sizes (Ï, rÂ², effect magnitude)
â p-values with scientific notation (p=2.7Ã10â»â´)
â Temporal/spatial ranges (2021-2025, 100-700 Mb)
â Significance threshold (Î±=0.05)

Avoid:

â Vague terms (“improved”, “changed”) without quantification
â p-values without effect sizes
â Missing sample sizes
â Unspecified statistical methods

Adding to Jupyter Notebooks

import json

fig_description = {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
        "### Figure X. [Title]\n",
        "\n",
        "**[Opening with sample sizes]**\n",
        "\n",
        "**Key Findings:**\n",
        "\n",
        "**[Metric 1]**: [Statistical result]. [Interpretation].\n",
        "\n",
        "**Interpretation:** [Synthesis]."
    ]
}

# Insert after the plotting cell
nb['cells'].insert(plot_cell_idx + 1, fig_description)

iTOL (Interactive Tree of Life) Dataset Creation

Overview

iTOL is a web-based tool for phylogenetic tree visualization. Creating annotation datasets requires specific formats and understanding format differences between legacy and modern approaches.

Key Format Types

1. DATASET_STYLE (Modern Format for Branch/Node Coloring)

Use for coloring individual terminal branches or nodes.

Critical requirements:

Use SEPARATOR COMMA (not TAB)
Format: species,branch,node,#color,width,style
The three fields (branch/node/style) are all required even though only one is used

DATASET_STYLE
SEPARATOR COMMA

DATASET_LABEL,Terminal Branch Colors by Taxonomy
COLOR,#ff0000

DATA
Homo_sapiens,branch,node,#C084C0,2,normal
Mus_musculus,branch,node,#C084C0,2,normal

Common errors avoided:

â Using TAB separator â causes “Invalid color definition” errors
â Using clade instead of individual species â all branches get same color
â Omitting required fields â format errors

2. DATASET_BINARY (Presence/Absence Markers)

Use for adding symbols (checkmarks, stars, etc.) to specific species.

DATASET_BINARY
SEPARATOR TAB

DATASET_LABEL	Dual Curation

FIELD_SHAPES	6
FIELD_LABELS	Dual Curation
FIELD_COLORS	#FF0000

LEGEND_TITLE	Curation Status
LEGEND_SHAPES	6
LEGEND_COLORS	#FF0000
LEGEND_LABELS	Dual Curation

DATA
Homo_sapiens	1
Mus_musculus	1

Symbol codes:

1 = circle, 2 = square, 3 = diamond, 4 = triangle, 5 = filled square, 6 = checkmark

3. DATASET_COLORSTRIP (Colored Rectangles)

DATASET_COLORSTRIP
SEPARATOR TAB

DATASET_LABEL	Taxonomic Lineage

DATA
Homo_sapiens	#C084C0	Mammals
Mus_musculus	#C084C0	Mammals

Species Name Synchronization

Problem: Tree species names often differ from metadata due to:

TimeTree database replacements (standardization)
Spelling variants (e.g., Chiropotes_utahickae vs Chiropotes_utahicki)
Case differences (e.g., Alca_torda vs Alca_Torda)
Trailing spaces in CSV files

Solution workflow:

Export tree species list: grep -oE "[A-Z][a-z]+_[a-z]+" Tree.nwk | sort -u
Compare with metadata species list
Create replacement mapping JSON
Apply systematically to tree AND all annotation files
Document replacements for reproducibility

Best practice: Create separate versions:

*_corrected.* – After TimeTree replacements
*_final.* – After all name variant corrections

Handling Reference Species Added by Tree Builders

TimeTree and similar tools may add reference species not in your original dataset for:

Phylogenetic completeness
Temporal calibration
Topological constraints

Document these additions:

Identify species in tree but not in metadata
Research their phylogenetic role
Create separate iTOL dataset to highlight them
Document why they were added

Example:

# Create dataset for reference species
timetree_additions = ["Species_one", "Species_two"]
# Use different symbol/color to distinguish from your species

Color Schemes for Taxonomy

Standard color palette for major vertebrate groups:

colors = {
    'Mammals': '#C084C0',       # Purple
    'Birds': '#FFD700',         # Gold
    'Reptiles': '#9370DB',      # Medium Purple
    'Amphibians': '#98D8C8',    # Turquoise
    'Fishes': '#87CEEB',        # Sky Blue
    'Invertebrates': '#8B4513'  # Brown
}

Troubleshooting iTOL Errors

Error Message	Cause	Solution
“Invalid color definition ‘normal'”	Wrong field order in TREE_COLORS	Switch to DATASET_STYLE format
“Invalid color definition ‘node'”	Using TAB separator with DATASET_STYLE	Change to `SEPARATOR COMMA`
“All branches same color”	Using clade-based coloring with overlapping definitions	Color individual terminal branches instead
Species missing from dataset	Name mismatch between tree and metadata	Create name mapping and apply to all files
“Other” lineage shown	New names from replacements lack lineage info	Map new names to lineages from original names

File Organization Best Practice

phylo/
âââ Tree.nwk                              # Original
âââ Tree_corrected.nwk                    # After TimeTree replacements
âââ Tree_final.nwk                        # After all name corrections
âââ itol_branch_colors_final.txt          # Terminal branch colors
âââ itol_taxonomic_colorstrip_final.txt   # Colored strips
âââ itol_dual_curation_binary_final.txt   # Binary markers
âââ itol_timetree_additions_final.txt     # Reference species markers
âââ species_replacements.json             # TimeTree replacements
âââ name_variant_replacements.json        # Spelling/case fixes
âââ SPECIES_CORRECTIONS_SUMMARY.md        # Full documentation

Updating iTOL Config Color Schemes

When updating color schemes across multiple iTOL configuration files, colors appear in multiple locations with different syntax:

Files requiring updates (for 3-category example):

Colorstrip configs (itol_3category_colorstrip_UPDATED.txt):
- LEGEND_COLORS line: tab-separated hex values
- Individual species rows: species_name<tab>category<tab>#HEXCODE
Label configs (itol_3category_labels_UPDATED.txt):
- LEGEND_COLORS, line: comma-separated hex values
- DATA rows: species,label,label,#HEXCODE,1,normal
Branch color configs (itol_3category_branch_colors_UPDATED.txt):
- DATA rows: species<tab>branch<tab>#HEXCODE<tab>normal<tab>2
Binary highlight configs (one per category):
- COLOR line: single hex value
- LEGEND_COLORS line: single hex value
- FIELD_COLORS line: single hex value

Efficient Update Strategy:

Use Edit tool with replace_all=true for each oldânew color mapping:

# Update all instances of old color across file
Edit(
    file_path="itol_3category_colorstrip_UPDATED.txt",
    old_string="#3498db",
    new_string="#FF8C00",
    replace_all=True
)

Typical color update sequence:

Map oldânew colors (e.g., blueâorange, orangeâgreen, greenâblue)
Update all files with first mapping (old blueânew orange)
Update all files with second mapping (old orangeânew green)
Update all files with third mapping (old greenânew blue)

Files to update (for 3-category system):

itol_3category_colorstrip_UPDATED.txt
itol_3category_labels_UPDATED.txt
itol_3category_branch_colors_UPDATED.txt
itol_3category_phased_dual_binary_UPDATED.txt
itol_3category_phased_single_binary_UPDATED.txt
itol_3category_pri_alt_single_binary_UPDATED.txt

Verification:

Grep for old hex codes to confirm all replaced
Check LEGEND_COLORS lines match DATA row colors
Verify binary files use correct category color

Common Color Scheme Examples:

VGP Curation 3-Category System:

COLORS = {
    'Phased+Dual': '#FF8C00',      # Dark orange
    'Phased+Single': '#50C878',    # Emerald green
    'Pri/alt+Single': '#4169E1'    # Royal blue
}

References

Matplotlib documentation: https://matplotlib.org/
Seaborn visualization: https://seaborn.pydata.org/
iTOL documentation: https://itol.embl.de/help.cgi
VGP curation analysis: Real-world example of these patterns

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台