data-visualization
npx skills add https://github.com/delphine-l/claude_global --skill data-visualization
Agent 安装分布
Skill 文档
Data Visualization Best Practices
Expert guidance for creating publication-quality scientific visualizations, avoiding common pitfalls, and optimizing figure clarity.
When to Use This Skill
- Creating figures for scientific publications
- Debugging misleading or distorted visualizations
- Optimizing figure layouts and element sizes
- Choosing appropriate plot types for data characteristics
- Ensuring statistical annotations fit properly
- Generating images for sharing with Claude or other AI tools
Common Pitfalls with Log-Scale Plots
Violin Plots on Log Scales
Problem: Violin plots use Kernel Density Estimation (KDE) in linear space, then the axis is transformed to log scale. This causes severe visual distortion where the violin shape doesn’t accurately represent the actual data distribution.
Symptoms:
- Smooth, blob-like violin shapes on log axes
- Visual representation suggests even distribution but histogram shows heavy clustering
- Particularly problematic with right-skewed data heavily concentrated in one region
Example of the problem:
# â BAD: Violin plot on log scale
import matplotlib.pyplot as plt
import numpy as np
data = np.random.exponential(10, 1000) # Right-skewed data
fig, ax = plt.subplots()
ax.violinplot([data])
ax.set_yscale('log') # Distorts the violin shape!
# Result: Smooth violin that doesn't show the true concentration at low values
Solution 1: Use boxplots instead
# â
GOOD: Boxplot on log scale
ax.boxplot([data])
ax.set_yscale('log') # Boxplot statistics remain meaningful
Why boxplots work: Boxplot statistics (median, quartiles, outliers) are calculated as specific values, not density estimates, so they remain meaningful on log scales.
Solution 2: Log-transform data first
# â
ALTERNATIVE: Log-transform data first, then use violin on linear axis
log_data = np.log10(data[data > 0])
ax.violinplot([log_data])
ax.set_ylabel('log10(Value)')
# Keep linear axis - violin now accurately represents log-space distribution
Solution 3: Use histograms with log axes
# â
GOOD: Histogram with log y-axis shows true frequency distribution
ax.hist(data, bins=30)
ax.set_xscale('log') # Data axis
ax.set_yscale('log') # Frequency axis - shows concentration clearly
Impact
This pitfall can lead to misleading figures in publications where the visual representation contradicts the actual data distribution. In our VGP curation analysis, this affected 4 different figures before correction.
Outlier Handling: Show First, Decide Later
Default stance: Show ALL data points in initial visualizations
# CORRECT - show all data
plt.boxplot(data, showfliers=True)
# AVOID initially - hides potentially important data
plt.boxplot(data, showfliers=False)
Rationale:
- Outliers may be biologically meaningful
- Filtering decisions should be informed by seeing complete data
- Easy to filter later, hard to know what you missed
- Patterns in outliers can reveal data quality issues
Workflow:
- Generate figures with all data (
showfliers=True) - Review with domain expert
- Decide case-by-case if outliers should be excluded
- Document rationale for any exclusions
Only filter outliers when:
- Technical artifact confirmed (e.g., processing error)
- Prevents seeing relevant patterns in bulk of data
- Documented and justified in methods
- Alternative view with all data provided in supplement
Example documentation:
# Remove known technical outlier
"""
Excluded assembly GCA_123456 from Figure 2 analysis:
- Scaffold N50 = 500 Gb (500Ã larger than genome size)
- Confirmed as assembly processing error in NCBI notes
- Other metrics for this assembly are valid and included in other figures
"""
Publication Figure Refinement
Element Sizing for Clarity
When figures are cluttered or elements overlap:
# Point sizes: Reduce for dense data
ax.scatter(..., s=25, alpha=0.5) # Down from s=60
# Line widths: Thinner lines reduce visual clutter
ax.plot(..., linewidth=1.5) # Down from 2.5
# Text sizes: Prevent overlap
ax.text(..., fontsize=8) # Down from 10
# Error bar cap sizes
ax.errorbar(..., capsize=5) # Standard readable size
P-value and Annotation Positioning
Problem: Statistical annotations (p-values, significance stars) often placed outside plot bounds
Solution: Position relative to data range with explicit limits
# Calculate data range
y_max = max([d.max() for d in data_list])
y_min = min([d.min() for d in data_list])
# Position annotations within plot
y_pos = y_max * 0.92 # 92% of max, not 105% which goes outside
ax.text(x_pos, y_pos, 'p < 0.001***', ha='center', fontsize=9)
# Set explicit limits with headroom
ax.set_ylim(y_min * 0.95 if y_min > 0 else -5, y_max * 1.05)
Panel Layout Testing
Test both orientations to find clearest presentation:
# Side-by-side (good for comparing distributions)
fig, axes = plt.subplots(1, 2, figsize=(16, 7))
# Stacked vertically (good for larger individual panels)
fig, axes = plt.subplots(2, 1, figsize=(10, 14))
Decision criteria:
- Side-by-side: Better for direct left-right comparison
- Stacked: Better when each panel needs more space
- Let user feedback guide the choice
Adding Sample Sizes to Legends
Why Sample Sizes Matter: Readers need to assess statistical power at a glance. Include sample sizes directly in legend labels for scientific figures.
Pattern 1: Simple Legend with Sample Sizes
# Calculate sample sizes once
category_sizes = df.groupby('category').size().to_dict()
# Use in scatter plot legend
for category in categories:
data = df[df['category'] == category]
ax.scatter(data['x'], data['y'],
label=f"{category} (n={category_sizes[category]})")
ax.legend(loc='best')
Pattern 2: Custom Legend for Complex Plots
When you have multiple marker types (e.g., technology + category), create custom legend:
from matplotlib.lines import Line2D
# Calculate sizes
category_sizes = df.groupby('category').size().to_dict()
# Create custom legend elements
custom_lines = [
Line2D([0], [0], color=colors['Cat1'], marker='o', linestyle='', markersize=8),
Line2D([0], [0], color=colors['Cat2'], marker='o', linestyle='', markersize=8),
]
custom_labels = [
f"Category 1 (n={category_sizes['Cat1']})",
f"Category 2 (n={category_sizes['Cat2']})",
]
ax.legend(custom_lines, custom_labels, loc='best', fontsize=9)
Pattern 3: Multi-Panel Figures – Show Legend Once
For 2Ã3 or similar grids, show legend only in first subplot:
# Calculate once, use in all panels
category_sizes = df.groupby('category').size().to_dict()
for idx, metric in enumerate(metrics):
ax = axes[idx]
for category in categories:
# Only add label for first subplot
if idx == 0:
label_text = f"{category} (n={category_sizes[category]})"
else:
label_text = ''
ax.scatter(..., label=label_text)
if idx == 0:
ax.legend(loc='best')
Best Practices:
- Calculate sizes once at the top (efficient, avoids repeated computation)
- Use consistent format:
Category Name (n=123) - For small panels, use
fontsize=7-9 - Consider
ncol=1for vertical layout if space allows - Place sample sizes in legend OR as text annotations, not both
Publication Standards:
- Nature/Science: Strongly recommended for all comparative figures
- PLOS: Required for sample size transparency
- Cell: Expected in methods or figure legends
- General guideline: Always include when comparing groups
Example: Temporal Analysis with Categories
# Temporal trends by category
category_sizes = df.groupby('category').size().to_dict()
fig, ax = plt.subplots(figsize=(10, 6))
for category in ['Phased+Dual', 'Phased+Single', 'Pri/alt+Single']:
data = df[df['category'] == category]
ax.scatter(data['year'], data['quality_metric'],
color=colors[category],
label=f"{category} (n={category_sizes[category]})",
alpha=0.6, s=40)
ax.set_xlabel('Year')
ax.set_ylabel('Assembly Quality')
ax.legend(loc='best', fontsize=9)
Visualizing Category Proportions Over Time
Use Case: Show how the relative proportions of categories have changed over time.
Dual-Panel Approach: Proportions + Absolute Counts
Show both relative and absolute trends using side-by-side panels.
Left Panel: Stacked area chart (proportions sum to 100%) Right Panel: Stacked bar chart (shows actual sample sizes)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Calculate counts and proportions by year
year_category_counts = df.groupby(['year', 'category']).size().unstack(fill_value=0)
year_category_proportions = year_category_counts.div(
year_category_counts.sum(axis=1), axis=0
) * 100
# Dual-panel figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
# Panel 1: Stacked area (proportions)
years = year_category_proportions.index
categories = ['Cat1', 'Cat2', 'Cat3']
# Calculate total sizes for legend
total_counts = df.groupby('category').size().to_dict()
bottom = np.zeros(len(years))
for category in categories:
values = year_category_proportions[category].values
ax1.fill_between(years, bottom, bottom + values,
label=f"{category} (n={total_counts[category]})",
color=colors[category], alpha=0.7)
bottom += values
ax1.set_xlabel('Year', fontsize=12)
ax1.set_ylabel('Proportion (%)', fontsize=12)
ax1.set_title('Category Proportions Over Time', fontsize=14, fontweight='bold')
ax1.set_ylim(0, 100)
ax1.legend(loc='best', fontsize=10)
ax1.grid(axis='y', alpha=0.3)
ax1.xaxis.set_major_locator(plt.MaxNLocator(integer=True))
# Panel 2: Stacked bar (absolute counts)
year_category_counts[categories].plot(
kind='bar', stacked=True, ax=ax2,
color=[colors[c] for c in categories],
width=0.7, edgecolor='black', linewidth=0.5
)
ax2.set_xlabel('Year', fontsize=12)
ax2.set_ylabel('Number of Samples', fontsize=12)
ax2.set_title('Absolute Counts by Category', fontsize=14, fontweight='bold')
ax2.legend(title='Category',
labels=[f"{c} (n={total_counts[c]})" for c in categories],
loc='upper left', fontsize=9)
ax2.grid(axis='y', alpha=0.3)
ax2.set_xticklabels([int(y) for y in years], rotation=0)
# Add totals on top of bars
for i, year in enumerate(years):
total = year_category_counts.loc[year].sum()
ax2.text(i, total + 2, str(int(total)),
ha='center', va='bottom', fontsize=9, fontweight='bold')
plt.tight_layout()
plt.savefig('category_proportions.png', dpi=150, bbox_inches='tight')
Why Both Panels?
Proportions (Area Chart):
- Shows relative shifts in category usage
- Easy to see if one category is growing/declining
- Sums to 100% (intuitive interpretation)
Absolute Counts (Bar Chart):
- Shows actual sample sizes (statistical power)
- Reveals total data volume changes
- Helps interpret proportion changes (growing proportion of shrinking pie?)
Together: Complete picture of temporal trends
Styling Tips:
- Colors: Use colorblind-safe palette consistently across both panels
- Edge colors: Black edges on bars improve readability (
linewidth=0.5) - Totals: Add count labels above stacked bars for context
- X-axis: Integer years, not decimals (use
MaxNLocator(integer=True)) - Legend: Include total sample sizes:
Category (n=123)
When to Use:
- Tracking category adoption over time
- Showing methodology shifts in field
- Demonstrating changing experimental approaches
- Any temporal categorical composition analysis
Color Schemes
Colorblind-Friendly Palettes
Standard palette for dual comparisons:
COLORS = {
'Group1': '#0173B2', # Blue
'Group2': '#DE8F05' # Orange
}
Accessible to most common color vision deficiencies.
Comprehensive Colorblind-Safe Color Palettes for Scientific Figures
Problem: Poor Color Accessibility
Common issue: Default color schemes often use green-blue or red-green combinations that are indistinguishable for colorblind viewers (~8% of population).
Examples of problematic combinations:
- Green + Blue (similar for deuteranopia/protanopia)
- Red + Green (classic colorblindness issue)
- Light blue + Dark blue (insufficient contrast)
Okabe-Ito Palette (Recommended by Nature)
The gold standard for scientific figures, developed by Masataka Okabe and Kei Ito.
Complete 8-color palette (hex codes):
okabe_ito = {
'orange': '#E69F00',
'sky_blue': '#56B4E9',
'bluish_green': '#009E73',
'yellow': '#F0E442',
'blue': '#0072B2',
'vermillion': '#D55E00',
'reddish_purple': '#CC79A7',
'black': '#000000'
}
For 3 categories (maximum distinction):
# Best combination for 3 categories
category_colors = {
'Category_A': '#0072B2', # Blue
'Category_B': '#E69F00', # Orange
'Category_C': '#CC79A7' # Reddish Purple
}
Why this combination:
- Blue (cool) + Orange (warm) + Purple (neutral) = maximum perceptual separation
- Works for all types of colorblindness (deuteranopia, protanopia, tritanopia)
- Blue-orange is universally distinguishable
- No green-blue or red-green confusion
For 5+ categories, use additional colors from the palette:
five_colors = {
'Cat_1': '#0072B2', # Blue
'Cat_2': '#E69F00', # Orange
'Cat_3': '#CC79A7', # Reddish Purple
'Cat_4': '#D55E00', # Vermillion
'Cat_5': '#F0E442' # Yellow
}
Paul Tol’s Bright Palette (Alternative)
Another scientifically validated option:
paul_tol_bright = {
'blue': '#4477AA',
'red': '#EE6677',
'green': '#228833',
'yellow': '#CCBB44',
'cyan': '#66CCEE',
'purple': '#AA3377',
'grey': '#BBBBBB'
}
Implementation in Matplotlib/Seaborn
Set up colorblind-safe palette:
import matplotlib.pyplot as plt
import seaborn as sns
# Okabe-Ito colors for 3 categories
colors = ['#0072B2', '#E69F00', '#CC79A7']
# Apply to matplotlib
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=colors)
# Or use directly in plots
fig, ax = plt.subplots()
for i, category in enumerate(['A', 'B', 'C']):
ax.plot(x, y[i], color=colors[i], label=category)
For categorical plots (seaborn):
# Define palette dictionary
palette = {
'Phased+Dual': '#0072B2',
'Phased+Single': '#E69F00',
'Pri/alt+Single': '#CC79A7'
}
# Use in seaborn
sns.boxplot(data=df, x='category', y='value', palette=palette)
Best Practices
- Avoid red-green combinations – Most common colorblindness type
- Use patterns/markers too – Combine color with shapes for redundancy
- Test your figures – Use colorblindness simulators online
- Document your palette – Add comment explaining choice
- Be consistent – Use same colors for same categories across all figures
Example: Complete Figure Setup
# Okabe-Ito palette for 3 categories
category_colors = {
'Method_A': '#0072B2', # Blue (Okabe-Ito)
'Method_B': '#E69F00', # Orange (Okabe-Ito)
'Method_C': '#CC79A7' # Reddish Purple (Okabe-Ito)
}
# Also use different markers for redundancy
markers = {
'Method_A': 'o', # circle
'Method_B': 's', # square
'Method_C': '^' # triangle
}
# Plot with both color and marker distinction
for method in ['Method_A', 'Method_B', 'Method_C']:
data = df[df['method'] == method]
plt.scatter(data['x'], data['y'],
color=category_colors[method],
marker=markers[method],
label=method, s=50, alpha=0.7)
plt.legend()
plt.title('Analysis Results (Colorblind-Safe)')
When to Use Which Palette
Okabe-Ito:
- Scientific publications (recommended by Nature)
- 3-8 categorical variables
- Need maximum accessibility
- Standard for academic figures
Paul Tol:
- Alternative when you want different aesthetics
- Good for presentations
- Widely used in Europe
Seaborn ‘colorblind’:
- Quick matplotlib/seaborn integration
- Based on similar principles
- Built-in convenience
Resources
- Okabe-Ito palette: https://jfly.uni-koeln.de/color/
- Paul Tol’s schemes: https://personal.sron.nl/~pault/
- Colorblind simulator: https://www.color-blindness.com/coblis-color-blindness-simulator/
- Venngage guide: https://venngage.com/blog/color-blind-friendly-palette/
Real Example: VGP Assembly Analysis
# Before: Similar blues caused confusion
old_colors = {
'Phased+Dual': '#1976D2', # Dark blue
'Phased+Single': '#4FC3F7', # Light blue - TOO SIMILAR!
'Pri/alt+Single': '#66BB6A' # Green - confusing with blue
}
# After: Okabe-Ito palette with maximum distinction
new_colors = {
'Phased+Dual': '#0072B2', # Blue
'Phased+Single': '#E69F00', # Orange - DISTINCT!
'Pri/alt+Single': '#CC79A7' # Purple - DISTINCT!
}
This ensures all readers can distinguish categories regardless of color vision deficiency.
Image Size Constraints for Claude API
CRITICAL: When generating images to share with Claude (for review, debugging, etc.), images must not exceed 8000 pixels in either dimension.
Check Image Size Before Opening
Always verify image dimensions before trying to display them in Claude:
from PIL import Image
# Check dimensions
img = Image.open('figure.png')
print(f"Image size: {img.width}x{img.height}")
if img.width > 8000 or img.height > 8000:
print(f"â ï¸ WARNING: Image too large for Claude API!")
print(f" Claude limit: 8000px max dimension")
print(f" Your image: {img.width}x{img.height}")
Set Size Constraints When Generating Figures
For matplotlib/seaborn figures:
import matplotlib.pyplot as plt
# Set figure size to stay under Claude's limits
# Rule of thumb: Keep figsize under (80, 80) at 100 DPI
# Or under (26, 26) at 300 DPI
fig, ax = plt.subplots(figsize=(16, 12)) # Safe: 1600x1200 at 100 DPI
# When saving, control DPI to stay under limits
# 7999px / 300 DPI = 26.6 inches max
# 7999px / 100 DPI = 79.9 inches max
plt.savefig('figure.png', dpi=300, bbox_inches='tight') # Max ~26x26 inches
# For very large figures, use lower DPI
plt.savefig('large_figure.png', dpi=100, bbox_inches='tight') # Max ~80x80 inches
Safe figure size presets:
# Publication quality (300 DPI) - fits Claude limit
FIG_SIZES = {
'single_column': (3.5, 4), # 1050x1200 px
'double_column': (7, 5), # 2100x1500 px
'full_page': (7, 9), # 2100x2700 px
'poster': (20, 15), # 6000x4500 px - safe for Claude
'max_claude': (26, 26), # 7800x7800 px - maximum safe size
}
fig, ax = plt.subplots(figsize=FIG_SIZES['double_column'])
plt.savefig('figure.png', dpi=300, bbox_inches='tight')
Resize Oversized Images
If you have an existing image that’s too large:
from PIL import Image
def resize_for_claude(image_path, max_dim=7999, output_path=None):
"""
Resize image to fit Claude's API constraints.
Args:
image_path: Path to input image
max_dim: Maximum dimension (default 7999 for safety margin)
output_path: Output path (default: adds '_resized' to filename)
"""
img = Image.open(image_path)
# Check if resize needed
if img.width <= max_dim and img.height <= max_dim:
print(f"â Image OK: {img.width}x{img.height}")
return image_path
# Calculate new size preserving aspect ratio
img.thumbnail((max_dim, max_dim), Image.Resampling.LANCZOS)
# Save
if output_path is None:
base = image_path.rsplit('.', 1)[0]
ext = image_path.rsplit('.', 1)[1]
output_path = f"{base}_resized.{ext}"
img.save(output_path)
print(f"â Resized: {image_path}")
print(f" Original: {Image.open(image_path).size}")
print(f" New: {img.size}")
print(f" Saved: {output_path}")
return output_path
# Usage
resize_for_claude('large_figure.png')
Quick Checks
Bash one-liner to check size:
# Using ImageMagick
identify figure.png | grep -o '[0-9]*x[0-9]*'
# Check if oversized
python3 -c "from PIL import Image; img=Image.open('figure.png'); print(f'{img.width}x{img.height}'); exit(0 if img.width<=7999 and img.height<=7999 else 1)" && echo "OK" || echo "TOO LARGE"
Add to notebook imports:
# Standard imports for Claude-compatible figures
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
# Set global figure size limit
plt.rcParams['figure.max_open_warning'] = 50
MAX_CLAUDE_DIM = 7999 # Claude API limit: 8000px, use 7999 for safety
def save_figure(filename, dpi=300, **kwargs):
"""Save figure with Claude size constraint check."""
plt.savefig(filename, dpi=dpi, bbox_inches='tight', **kwargs)
# Verify size
img = Image.open(filename)
if img.width > MAX_CLAUDE_DIM or img.height > MAX_CLAUDE_DIM:
print(f"â ï¸ WARNING: {filename} exceeds Claude limit!")
print(f" Size: {img.width}x{img.height} (max: {MAX_CLAUDE_DIM})")
print(f" Resizing...")
img.thumbnail((MAX_CLAUDE_DIM, MAX_CLAUDE_DIM), Image.Resampling.LANCZOS)
img.save(filename)
print(f" â Resized to: {img.width}x{img.height}")
else:
print(f"â Saved {filename}: {img.width}x{img.height}")
Common Scenarios
High-DPI screenshots from Retina displays:
- Retina screenshots are 2x pixel density
- A full-screen 4K monitor screenshot can be 7680×4320 (OK)
- A 5K monitor screenshot is 10240×5760 (TOO LARGE!)
- Solution: Resize before sharing or take partial screenshots
Multi-panel figures:
# Instead of one huge figure with many panels
fig, axes = plt.subplots(4, 4, figsize=(40, 40)) # Could be 12000x12000 px!
# Split into smaller figures
for i in range(4):
fig, axes = plt.subplots(2, 2, figsize=(12, 12)) # 3600x3600 px - safe!
# Plot subset of panels
plt.savefig(f'figure_part{i}.png', dpi=300, bbox_inches='tight')
Error Recovery
If you get the error:
API Error: 400 ... image dimensions exceed max allowed size: 8000 pixels
The error is stuck in conversation history. To recover:
- Skip the message: “Please ignore the oversized image in the previous message”
- Resize and resend: Use
resize_for_claude()function above - Use /safe-clear: Save context and start fresh (if command available)
Jupyter Notebook Image Size Issues
Oversized Images from Combined Output
Problem: Jupyter notebook saves figures as extremely tall images (e.g., 1541 x 42,011 pixels) that exceed the 8000 pixel limit.
Cause: When a cell generates both a figure AND text output (print statements, statistical results), Jupyter captures both as a single tall image. The text output is rendered as image pixels below the figure, creating a massive combined image.
Symptoms:
- Image dimensions like 1541 x 42,011 pixels (height >> 8000)
- Figure displays fine in notebook but won’t display in Claude or other tools
- Error: “image dimensions exceed max allowed size: 8000 pixels”
Example of the problem:
# Cell that creates oversized image
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
# ... plotting code ...
plt.tight_layout()
plt.savefig('figure.png', dpi=150, bbox_inches='tight')
plt.show()
# Text output after figure (PROBLEM!)
print("Statistical Results:")
print(f"Spearman correlation: rho={rho:.3f}, p={pval:.4f}")
# Multiple print statements create tall text output
# Jupyter combines this with figure into one 42K pixel tall image
Solution 1: Split into multiple figures
Instead of creating one large multi-panel figure, split into smaller figures:
# â
GOOD: Split 2Ã3 grid into two 1Ã3 grids
# Figure 1: First 3 panels
fig1, axes1 = plt.subplots(1, 3, figsize=(10, 3.5))
# ... plot first 3 panels ...
plt.savefig('figure_part1.png', dpi=150, bbox_inches='tight')
plt.show()
# Figure 2: Second 3 panels
fig2, axes2 = plt.subplots(1, 3, figsize=(10, 3.5))
# ... plot second 3 panels ...
plt.savefig('figure_part2.png', dpi=150, bbox_inches='tight')
plt.show()
Solution 2: Separate text output into different cell
Move print statements to a separate cell after the figure:
# Cell 1: Just the figure
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
# ... plotting code ...
plt.savefig('figure.png', dpi=150, bbox_inches='tight')
plt.show()
# Cell 2: Text output (separate!)
print("Statistical Results:")
print(f"Spearman correlation: rho={rho:.3f}, p={pval:.4f}")
Solution 3: Suppress text output in figure cell
# Capture results without printing
results = []
for category in categories:
rho, pval = stats.spearmanr(x, y)
results.append({'category': category, 'rho': rho, 'pval': pval})
# Create figure (no print statements!)
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
# ... plotting code ...
plt.savefig('figure.png', dpi=150, bbox_inches='tight')
plt.show()
# Display results in separate cell or as DataFrame
results_df = pd.DataFrame(results)
When to split figures:
- Multi-panel figures with many subplots (3+ rows à 2+ columns)
- Any figure where dimensions approach 8000 pixels
- When cell has significant text output after figure
- When total cell output height feels very long in notebook
Prevention:
- Use the
save_figure()helper from jupyter-notebook skill (auto-checks size) - Keep figure cells focused on visualization only
- Save statistical results to CSV files instead of printing
- Use separate markdown cells for result interpretation
Matplotlib Text Positioning Creates Empty Space
Problem: Saved figure has large area of empty white space above the actual plot, making the figure much taller than necessary.
Cause: Using transform=ax.get_xaxis_transform() with data coordinate y-values positions text far outside the plot bounds. The transform uses axis-relative coordinates (0-1) for x but data coordinates for y, so large y-values create huge positioning errors.
Symptoms:
- Empty white space at top (or bottom) of saved figure
- Text annotations not visible or way off the plot
tight_layout()andpad_inchesadjustments don’t fix it- Problem persists even after reducing figure size
Example of the problem:
# â BAD: Mixing coordinate systems
for category in categories:
data = df[df['category'] == category]
ax.scatter(data['x'], data['y'])
# Add significance marker
y_max = data['y'].max() # e.g., y_max = 2000000000 (2 billion)
# PROBLEM: y_max is in data coordinates, transform expects 0-1!
ax.text(0.5, y_max * 0.95, '***',
transform=ax.get_xaxis_transform(), # x in 0-1, y in data coords
ha='center', fontsize=12)
# This positions text at y = 1.9 billion in the mixed coordinate system!
plt.savefig('figure.png', dpi=150, bbox_inches='tight')
# Result: Massive empty space with text way above visible plot
Why this happens:
ax.get_xaxis_transform()uses axis coordinates (0-1) for x, data coordinates for yy_max * 0.95for scaffold N50 might be 1.9 billion (1.9e9)- Transform interprets this as 1.9 billion axis units above the plot
bbox_inches='tight'includes this invisible text, creating empty space
Solution 1: Position within data range (RECOMMENDED)
Calculate position within the actual data coordinate system:
# â
GOOD: Position within plot bounds using pure data coordinates
for category in categories:
data = df[df['category'] == category]
ax.scatter(data['x'], data['y'])
# Get actual data range
y_min, y_max = ax.get_ylim()
# Position at 90% of the visible range
y_pos = y_min + (y_max - y_min) * 0.90
# Use data coordinates (no transform needed)
ax.text(0.5, y_pos, '***',
ha='center', va='center', fontsize=12)
Solution 2: Use axis transform correctly with 0-1 coordinates
If you want to use the transform, use 0-1 range for y:
# â
GOOD: Both x and y in 0-1 axis coordinates
ax.text(0.5, 0.90, '***',
transform=ax.transAxes, # Both x and y in 0-1 range
ha='center', va='center', fontsize=12)
Solution 3: Use annotate with xycoords
# â
GOOD: Explicit coordinate specification
ax.annotate('***',
xy=(x_pos, y_pos), # Data coordinates
xycoords='data',
ha='center', va='center', fontsize=12)
Coordinate Transform Quick Reference:
| Transform | X coordinate | Y coordinate | Use case |
|---|---|---|---|
None (default) |
Data | Data | Normal plotting |
ax.transAxes |
0-1 (axis) | 0-1 (axis) | Position relative to axes |
ax.get_xaxis_transform() |
0-1 (axis) | Data | Span markers, axis labels |
ax.get_yaxis_transform() |
Data | 0-1 (axis) | Y-axis annotations |
When to use each approach:
- Data coordinates (no transform): Annotations tied to specific data points
- Axis coordinates (
transAxes): Labels in fixed positions (e.g., panel letters) - Mixed transforms: Advanced use only, requires careful coordinate scaling
Debugging tips:
- If empty space appears, check for text/annotation calls with large y-values
- Use
ax.get_ylim()to verify reasonable y-coordinate range - Temporarily comment out text/annotation calls to identify culprit
- Verify saved figure dimensions match expected size
Prevention:
- Prefer pure data coordinates for most annotations
- Only use transforms when specifically needed
- Always verify coordinate ranges match transform type
- Test saved figure size after adding annotations
Common Matplotlib Issues and Fixes
Float Year Labels on X-Axis
Problem: X-axis shows decimal years (2021.0, 2021.5, 2022.0) instead of clean integers (2021, 2022, 2023).
Cause: Matplotlib’s default tick formatter displays float values with decimals when the data type is float.
Solution: Use MaxNLocator with integer=True
import matplotlib.pyplot as plt
# After creating plot
ax.scatter(df['year'], df['value'])
# Fix x-axis to show only integer years
ax.xaxis.set_major_locator(plt.MaxNLocator(integer=True))
Why This Works:
MaxNLocator(integer=True)constrains tick locations to integers- Works even when underlying data is float (e.g.,
release_yearcolumn as float64) - Automatically chooses appropriate spacing (won’t show every year if range is large)
Common Use Case: Temporal analyses where year data is stored as float but should display as integer for readability.
Example:
# Data with float years
df['release_year'] = [2021.0, 2022.0, 2023.0, 2024.0, 2025.0]
fig, ax = plt.subplots()
ax.scatter(df['release_year'], df['metric'])
# Without fix: x-axis shows 2021.0, 2021.5, 2022.0, 2022.5, ...
# With fix: x-axis shows 2021, 2022, 2023, 2024, 2025
ax.xaxis.set_major_locator(plt.MaxNLocator(integer=True))
Best Practices
- Always check log-scale plots: If using KDE-based plots (violin, ridge) on log axes, verify against histogram
- Test element sizes: Regenerate figures with different sizes to find optimal clarity
- Explicit axis limits: Don’t rely on automatic limits when annotations are added
- Consistent styling: Use seaborn context and style for publication consistency
- High DPI: Save at 300 DPI minimum for publication (
dpi=300, bbox_inches='tight') - Optimize axis ranges for data distribution: When data is concentrated in narrow range, adjust axis limits to improve visibility
- Check image dimensions: Verify size before sharing with Claude (max 7999×7999 pixels)
- Set size constraints in scripts: Use safe figure sizes when generating images programmatically
Axis Range Optimization for Compressed Distributions
When data is concentrated at one end of range:
Problem: Cumulative distributions all at 80-100% look compressed with 0-100% Y-axis
Solution: Adjust axis limits to focus on data range
# For chromosome assignment cumulative distribution (mostly 80-100%)
ax.set_ylim(50, 100) # Start at 50% instead of 0%
# For legend placement with adjusted range
ax.legend(loc='upper left') # Prevents overlap with curves at top-right
When to adjust axis ranges:
- â Data concentrated in narrow range (e.g., 80-100%)
- â Improves visibility of differences
- â All relevant data still visible
- â Makes small differences more apparent
When NOT to adjust:
- â Would hide meaningful outliers
- â Creates misleading visual impression
- â Data actually spans full range
- â Standard in field to show full range (0-100%)
Best practice: Show both views if controversial
- Main figure: Zoomed range for clarity
- Supplementary: Full range for context
Scientific Figure Descriptions for Publications
Structure for Multi-Panel Figures
Opening Sentence: Overview + sample size + stratification
**[Analysis type] of [N] [unit]** across [timeframe/condition], stratified by
[categories]: Category1 (n=X), Category2 (n=Y), Category3 (n=Z).
Panel Descriptions: For each panel:
**[Metric Name]** (panel location): [Pattern observed]. [Statistical test]
(Ï=[value], p=[value]) shows [interpretation]. [Biological/technical context].
Closing Interpretation: Synthesize findings
**Interpretation:** [Overall pattern]. [Comparison across categories].
[Methodological implications]. [Connection to study goals].
Example: Temporal Trends Figure
### Figure 4. Temporal trends in assembly quality metrics for HiFi-only assemblies (2021-2025)
**Temporal analysis of six assembly quality metrics across 268 HiFi assemblies**
spanning 2021-2025, stratified by assembly and curation method: Phased+Dual
(n=101, blue), Phased+Single (n=42, orange), and Pri/alt+Single (n=125, purple).
Each panel displays individual assembly measurements (points) with linear
regression trend lines (dashed) for each category. Trend significance was
assessed using Spearman correlation (α=0.05).
**Key Findings:**
**Scaffold N50** (upper left): Pri/alt+Single assemblies show significant
improvement over time (Ï=0.32, p=2.7Ã10â»â´), increasing from ~100 Mb to ~700 Mb,
while Phased assemblies remain stable at ~100-200 Mb. This suggests technological
improvements in single-assembly methods during the HiFi era.
**Gap Density** (upper middle): All HiFi assemblies collectively show decreasing
gap density over time (Ï=-0.17, p=0.0057), indicating improved sequence continuity.
[Additional panels...]
**Interpretation:** Temporal trends are category-specific and metric-dependent.
Pri/alt+Single assemblies show quality improvements (N50, gap density) consistent
with technological advancement during 2021-2025. Phased assemblies remain stable
across most metrics, suggesting their quality is primarily methodology-determined.
Quantitative Details to Include
Always include:
- â Sample sizes (n=X) for each group
- â Statistical test used (Spearman, Mann-Whitney, etc.)
- â Effect sizes (Ï, r², effect magnitude)
- â p-values with scientific notation (p=2.7Ã10â»â´)
- â Temporal/spatial ranges (2021-2025, 100-700 Mb)
- â Significance threshold (α=0.05)
Avoid:
- â Vague terms (“improved”, “changed”) without quantification
- â p-values without effect sizes
- â Missing sample sizes
- â Unspecified statistical methods
Adding to Jupyter Notebooks
import json
fig_description = {
"cell_type": "markdown",
"metadata": {},
"source": [
"### Figure X. [Title]\n",
"\n",
"**[Opening with sample sizes]**\n",
"\n",
"**Key Findings:**\n",
"\n",
"**[Metric 1]**: [Statistical result]. [Interpretation].\n",
"\n",
"**Interpretation:** [Synthesis]."
]
}
# Insert after the plotting cell
nb['cells'].insert(plot_cell_idx + 1, fig_description)
iTOL (Interactive Tree of Life) Dataset Creation
Overview
iTOL is a web-based tool for phylogenetic tree visualization. Creating annotation datasets requires specific formats and understanding format differences between legacy and modern approaches.
Key Format Types
1. DATASET_STYLE (Modern Format for Branch/Node Coloring)
Use for coloring individual terminal branches or nodes.
Critical requirements:
- Use
SEPARATOR COMMA(not TAB) - Format:
species,branch,node,#color,width,style - The three fields (branch/node/style) are all required even though only one is used
DATASET_STYLE
SEPARATOR COMMA
DATASET_LABEL,Terminal Branch Colors by Taxonomy
COLOR,#ff0000
DATA
Homo_sapiens,branch,node,#C084C0,2,normal
Mus_musculus,branch,node,#C084C0,2,normal
Common errors avoided:
- â Using TAB separator â causes “Invalid color definition” errors
- â Using
cladeinstead of individual species â all branches get same color - â Omitting required fields â format errors
2. DATASET_BINARY (Presence/Absence Markers)
Use for adding symbols (checkmarks, stars, etc.) to specific species.
DATASET_BINARY
SEPARATOR TAB
DATASET_LABEL Dual Curation
FIELD_SHAPES 6
FIELD_LABELS Dual Curation
FIELD_COLORS #FF0000
LEGEND_TITLE Curation Status
LEGEND_SHAPES 6
LEGEND_COLORS #FF0000
LEGEND_LABELS Dual Curation
DATA
Homo_sapiens 1
Mus_musculus 1
Symbol codes:
- 1 = circle, 2 = square, 3 = diamond, 4 = triangle, 5 = filled square, 6 = checkmark
3. DATASET_COLORSTRIP (Colored Rectangles)
DATASET_COLORSTRIP
SEPARATOR TAB
DATASET_LABEL Taxonomic Lineage
DATA
Homo_sapiens #C084C0 Mammals
Mus_musculus #C084C0 Mammals
Species Name Synchronization
Problem: Tree species names often differ from metadata due to:
- TimeTree database replacements (standardization)
- Spelling variants (e.g.,
Chiropotes_utahickaevsChiropotes_utahicki) - Case differences (e.g.,
Alca_tordavsAlca_Torda) - Trailing spaces in CSV files
Solution workflow:
- Export tree species list:
grep -oE "[A-Z][a-z]+_[a-z]+" Tree.nwk | sort -u - Compare with metadata species list
- Create replacement mapping JSON
- Apply systematically to tree AND all annotation files
- Document replacements for reproducibility
Best practice: Create separate versions:
*_corrected.*– After TimeTree replacements*_final.*– After all name variant corrections
Handling Reference Species Added by Tree Builders
TimeTree and similar tools may add reference species not in your original dataset for:
- Phylogenetic completeness
- Temporal calibration
- Topological constraints
Document these additions:
- Identify species in tree but not in metadata
- Research their phylogenetic role
- Create separate iTOL dataset to highlight them
- Document why they were added
Example:
# Create dataset for reference species
timetree_additions = ["Species_one", "Species_two"]
# Use different symbol/color to distinguish from your species
Color Schemes for Taxonomy
Standard color palette for major vertebrate groups:
colors = {
'Mammals': '#C084C0', # Purple
'Birds': '#FFD700', # Gold
'Reptiles': '#9370DB', # Medium Purple
'Amphibians': '#98D8C8', # Turquoise
'Fishes': '#87CEEB', # Sky Blue
'Invertebrates': '#8B4513' # Brown
}
Troubleshooting iTOL Errors
| Error Message | Cause | Solution |
|---|---|---|
| “Invalid color definition ‘normal'” | Wrong field order in TREE_COLORS | Switch to DATASET_STYLE format |
| “Invalid color definition ‘node'” | Using TAB separator with DATASET_STYLE | Change to SEPARATOR COMMA |
| “All branches same color” | Using clade-based coloring with overlapping definitions | Color individual terminal branches instead |
| Species missing from dataset | Name mismatch between tree and metadata | Create name mapping and apply to all files |
| “Other” lineage shown | New names from replacements lack lineage info | Map new names to lineages from original names |
File Organization Best Practice
phylo/
âââ Tree.nwk # Original
âââ Tree_corrected.nwk # After TimeTree replacements
âââ Tree_final.nwk # After all name corrections
âââ itol_branch_colors_final.txt # Terminal branch colors
âââ itol_taxonomic_colorstrip_final.txt # Colored strips
âââ itol_dual_curation_binary_final.txt # Binary markers
âââ itol_timetree_additions_final.txt # Reference species markers
âââ species_replacements.json # TimeTree replacements
âââ name_variant_replacements.json # Spelling/case fixes
âââ SPECIES_CORRECTIONS_SUMMARY.md # Full documentation
Updating iTOL Config Color Schemes
When updating color schemes across multiple iTOL configuration files, colors appear in multiple locations with different syntax:
Files requiring updates (for 3-category example):
-
Colorstrip configs (
itol_3category_colorstrip_UPDATED.txt):LEGEND_COLORSline: tab-separated hex values- Individual species rows:
species_name<tab>category<tab>#HEXCODE
-
Label configs (
itol_3category_labels_UPDATED.txt):LEGEND_COLORS,line: comma-separated hex values- DATA rows:
species,label,label,#HEXCODE,1,normal
-
Branch color configs (
itol_3category_branch_colors_UPDATED.txt):- DATA rows:
species<tab>branch<tab>#HEXCODE<tab>normal<tab>2
- DATA rows:
-
Binary highlight configs (one per category):
COLORline: single hex valueLEGEND_COLORSline: single hex valueFIELD_COLORSline: single hex value
Efficient Update Strategy:
Use Edit tool with replace_all=true for each oldânew color mapping:
# Update all instances of old color across file
Edit(
file_path="itol_3category_colorstrip_UPDATED.txt",
old_string="#3498db",
new_string="#FF8C00",
replace_all=True
)
Typical color update sequence:
- Map oldânew colors (e.g., blueâorange, orangeâgreen, greenâblue)
- Update all files with first mapping (old blueânew orange)
- Update all files with second mapping (old orangeânew green)
- Update all files with third mapping (old greenânew blue)
Files to update (for 3-category system):
itol_3category_colorstrip_UPDATED.txtitol_3category_labels_UPDATED.txtitol_3category_branch_colors_UPDATED.txtitol_3category_phased_dual_binary_UPDATED.txtitol_3category_phased_single_binary_UPDATED.txtitol_3category_pri_alt_single_binary_UPDATED.txt
Verification:
- Grep for old hex codes to confirm all replaced
- Check LEGEND_COLORS lines match DATA row colors
- Verify binary files use correct category color
Common Color Scheme Examples:
VGP Curation 3-Category System:
COLORS = {
'Phased+Dual': '#FF8C00', # Dark orange
'Phased+Single': '#50C878', # Emerald green
'Pri/alt+Single': '#4169E1' # Royal blue
}
References
- Matplotlib documentation: https://matplotlib.org/
- Seaborn visualization: https://seaborn.pydata.org/
- iTOL documentation: https://itol.embl.de/help.cgi
- VGP curation analysis: Real-world example of these patterns