data-analysis-patterns

📁 delphine-l/claude_global 📅 1 day ago

总安装量

周安装量

#76274

全站排名

安装命令

npx skills add https://github.com/delphine-l/claude_global --skill data-analysis-patterns

Agent 安装分布

amp 1

cline 1

opencode 1

cursor 1

continue 1

kimi-cli 1

Skill 文档

Data Analysis Patterns

Expert guidance for making critical decisions in data analysis workflows, particularly around aggregation, recalculation, and maintaining analytical integrity.

When to Use This Skill

Deciding whether to recalculate from raw data vs reuse aggregated data
Changing category definitions in existing analyses
Ensuring accuracy in publication-quality analyses
Handling conflated features that need separation
Optimizing analysis pipelines without sacrificing correctness
Merging multi-source datasets with composite keys
Handling DataFrame type conversion issues during enrichment

Recalculating vs Reusing Aggregated Data

The Core Decision

When you have pre-aggregated data but need different categories or groupings, you face a choice:

Recalculate from raw data (slower but accurate)
Remap existing aggregates (faster but potentially inaccurate)

When to Recalculate from Raw Data

Scenario: You have species-level counts in 4 categories, but need different categories based on different criteria.

Example:

# Existing aggregated data
# Species | cat0 | cat1 | cat2 | cat3
# A       | 10   | 20   | 15   | 5

# OLD categories:
# cat0: Other (0 terminal OR >2 terminal)
# cat1: 2 terminal AND 0 interstitial
# cat2: 1 terminal AND 0 interstitial
# cat3: Any interstitial present

# NEW categories needed:
# cat1: 2 terminal (regardless of interstitial)
# cat2: 1 terminal (regardless of interstitial)
# cat3: 0 terminal (regardless of interstitial)

Wrong approach – Try to remap aggregates:

# â BAD: Approximation loses critical information
new_cat1 = old_cat1  # Only 2 terminal with 0 interstitial
new_cat2 = old_cat2  # Only 1 terminal with 0 interstitial
new_cat3 = old_cat0 + old_cat3  # But old_cat3 includes 0, 1, AND 2 terminal!
# Result: Inaccurate categorization

Correct approach – Recalculate from raw data:

# â GOOD: Recalculate from scaffold-level data
df_scaffolds = pd.read_csv('scaffold_level_data.csv')

def new_categorization(row):
    """Categorize based ONLY on terminal telomere count."""
    if row['num_terminal_telomeres'] == 2:
        return 'cat1_2terminal'
    elif row['num_terminal_telomeres'] == 1:
        return 'cat2_1terminal'
    else:  # 0 or other values
        return 'cat3_0terminal'

df_scaffolds['new_category'] = df_scaffolds.apply(new_categorization, axis=1)

# Aggregate to species level with NEW categories
species_summary = []
for species in df_scaffolds['species'].unique():
    df_sp = df_scaffolds[df_scaffolds['species'] == species]
    total = len(df_sp)

    species_summary.append({
        'species': species,
        'pct_cat1': (df_sp['new_category'] == 'cat1_2terminal').sum() / total * 100,
        'pct_cat2': (df_sp['new_category'] == 'cat2_1terminal').sum() / total * 100,
        'pct_cat3': (df_sp['new_category'] == 'cat3_0terminal').sum() / total * 100
    })

df_new = pd.DataFrame(species_summary)

When Recalculation is REQUIRED

Category definitions fundamentally change
- Example: Old categories mixed multiple features, new ones separate them
Previously conflated features need separation
- Example: “Has interstitial” conflated terminal count with interstitial presence
Aggregation criteria change
- Example: Per-scaffold â Per-chromosome â Per-species
Publication accuracy is critical
- Approximations acceptable for exploration, not for figures
The mapping is ambiguous or lossy
- If you can’t confidently map oldânew without information loss, recalculate

When Approximation May Be Acceptable

Exploratory analysis (not final publication figures)
Categories align closely with existing ones
Small differences acceptable for the research question
Raw data unavailable or extremely expensive to reprocess

IF using approximation:

# Document the approximation clearly
"""
NOTE: These percentages are APPROXIMATIONS because:
- Old cat3 included chromosomes with interstitial + varying terminal counts
- We approximate by combining cat0 + cat3 â new cat3
- Actual values may differ by ~5-10% from true recalculation
- Use for exploratory analysis only
"""

Composite Keys for Multi-Source Data Merging

The Problem: Simple Keys Aren’t Always Unique

When merging datasets from multiple sources (e.g., Google Sheets + enriched AWS data + unified CSV), a single identifier often isn’t enough to ensure uniqueness:

Example scenario:

ToLID alone: Not unique (same genome has hap1, hap2, maternal, paternal variants)
ToLID + Assembly version: Still not unique (same assembly processed with different pipelines)
ToLID + Assembly version + Pipeline version: Finally unique!

Creating Composite Keys

Pattern: Concatenate multiple fields with a delimiter to create unique identifiers.

# 3-part composite key for genome assemblies
df['_composite_key'] = (
    df['ToLID'].astype(str) + '|' +
    df['Assembly_version'].astype(str) + '|' +
    df['Pipeline_version'].astype(str)
)

# Verify uniqueness
assert df['_composite_key'].nunique() == len(df), f"Composite key not unique! {df['_composite_key'].nunique()} unique vs {len(df)} rows"

Key design decisions:

Delimiter choice: Use | or :: (avoid - or _ which may appear in field values)
Field order: Most to least specific (ToLID > Assembly > Pipeline)
Type casting: Always .astype(str) to avoid concatenation errors
Temporary column: Use _composite_key (underscore prefix) to indicate temporary/internal use

Handling Duplicates After Composite Key

Even with composite keys, you may still find duplicates due to:

Multiple uploads of the same assembly
Different curated dates for the same genome
Partial vs complete records

Resolution strategy:

def resolve_duplicates(df, key_column='_composite_key', date_column='Curated'):
    """
    For duplicate composite keys, keep:
    1. Latest record (by date) if dates differ
    2. Most complete record (most non-null values) if dates same/missing
    """
    duplicated_keys = df[df.duplicated(key_column, keep=False)][key_column].unique()

    rows_to_keep = []

    for key in duplicated_keys:
        dup_group = df[df[key_column] == key].copy()

        # Strategy 1: Keep latest by date
        if date_column in dup_group.columns:
            dup_group[date_column] = pd.to_datetime(dup_group[date_column], errors='coerce')
            if dup_group[date_column].notna().any():
                latest = dup_group.loc[dup_group[date_column].idxmax()]
                rows_to_keep.append(latest)
                continue

        # Strategy 2: Keep most complete record
        dup_group['_completeness'] = dup_group.notna().sum(axis=1)
        most_complete = dup_group.loc[dup_group['_completeness'].idxmax()]
        rows_to_keep.append(most_complete)

    # Combine deduplicated rows with non-duplicated rows
    df_deduped = pd.concat([
        df[~df[key_column].isin(duplicated_keys)],
        pd.DataFrame(rows_to_keep)
    ], ignore_index=True)

    return df_deduped

# Apply deduplication
df_clean = resolve_duplicates(df_new, key_column='_composite_key', date_column='Curated')

Merging with Composite Keys

Scenario: Merge new data (from Google Sheets) with previous enrichments (AWS-fetched QC data).

Critical requirement: Preserve manually enriched data while updating base fields.

# Create composite keys in both DataFrames
df_new['_composite_key'] = create_composite_key(df_new)
df_previous['_composite_key'] = create_composite_key(df_previous)

# Identify rows in both datasets
merged = df_new.merge(
    df_previous,
    on='_composite_key',
    how='left',
    suffixes=('_new', '_old')
)

# Detect conflicts (both have non-null values but they differ)
conflicts = []
for col in base_columns:  # Columns from Google Sheets
    col_new = f"{col}_new"
    col_old = f"{col}_old"

    if col_new in merged.columns and col_old in merged.columns:
        mask = (merged[col_new].notna() &
                merged[col_old].notna() &
                (merged[col_new] != merged[col_old]))

        if mask.any():
            conflicts.append({
                'column': col,
                'num_conflicts': mask.sum(),
                'examples': merged.loc[mask, ['_composite_key', col_new, col_old]].head(3)
            })

# Resolve conflicts based on strategy
CONFLICT_RESOLUTION = "NEW"  # or "OLD"

for col in all_columns:
    col_new = f"{col}_new"
    col_old = f"{col}_old"

    if col_new in merged.columns and col_old in merged.columns:
        if CONFLICT_RESOLUTION == "NEW":
            # Use new data, fallback to old if new is null
            merged[col] = merged[col_new].fillna(merged[col_old])
        else:  # "OLD"
            # Use old data, fallback to new if old is null
            merged[col] = merged[col_old].fillna(merged[col_new])
    elif col_new in merged.columns:
        merged[col] = merged[col_new]
    elif col_old in merged.columns:
        merged[col] = merged[col_old]

# Clean up suffixed columns
merged = merged[[col for col in merged.columns if not col.endswith('_new') and not col.endswith('_old')]]

Best Practices

Always verify uniqueness after creating composite key
Document composite key fields in data documentation
Handle duplicates explicitly before merging
Report conflicts to user for review
Provide conflict resolution options (NEW vs OLD)
Remove composite key before final save (it’s a temporary working column)
Test on small subset before processing full dataset

Example Use Case: VGP Genome Data Updates

Workflow:

Download latest genome table from Google Sheets
Create 3-part composite key (ToLID + Assembly + Pipeline)
Resolve duplicates (latest date, then most complete)
Merge with previous table to preserve AWS-enriched QC data
Detect and report conflicts
Let user choose NEW (Google Sheets) or OLD (preserved enrichments) values
Save merged result with composite key removed

Result: Updated genome metadata that preserves valuable enriched data while incorporating new/updated base information.

Separating Conflated Features

Pattern: When One Metric Combines Multiple Independent Features

Example: Telomere categories that mixed terminal AND interstitial presence

Old (conflated):

Cat 1: 2 terminal, 0 interstitial â
Cat 2: 1 terminal, 0 interstitial â
Cat 3: Has interstitial (could have 0, 1, OR 2 terminal!) â Conflated
Cat 4: 0 terminal, 0 interstitial â

Problem: Can’t answer “How many have 2 terminal telomeres?” because Cat 3 mixed terminal counts.

Solution: Separate into two independent analyses

Terminal telomere presence (3 categories: 2, 1, 0)
Interstitial telomere presence (separate binary or count analysis)

Implementation:

# Analysis 1: Terminal telomeres ONLY
df['terminal_category'] = df['num_terminal'].map({2: 'cat1', 1: 'cat2', 0: 'cat3'})

# Analysis 2: Interstitial telomeres ONLY (separate figure)
df['has_interstitial'] = df['num_interstitial'] > 0
df['interstitial_count'] = df['num_interstitial']

Benefits:

Clear interpretation of each feature independently
Can correlate features if needed (terminal vs interstitial)
No ambiguity in categories
Enables future analyses of each feature separately

AWS Data Enrichment Patterns

When enriching tabular data from AWS S3 repositories:

Multi-Source S3 Path Resolution

Problem: Primary data source (e.g., genome table) has incomplete S3 path coverage (e.g., 48/716 = 6.7%)

Solution: Two-tier path resolution strategy:

Direct Lookup: Map known paths from primary source
Path Inference: Construct and validate paths for missing entries

Implementation Pattern:

# Tier 1: Direct lookup from primary source
s3_lookup = {}
for _, row in df_primary.iterrows():
    key = row['identifier']
    path = row['s3_path']
    if pd.notna(key) and pd.notna(path):
        if key not in s3_lookup:
            s3_lookup[key] = path

df_target['_s3_path'] = df_target['identifier'].map(s3_lookup)

# Tier 2: Infer missing paths with validation
def infer_s3_path(identifier, metadata):
    """Construct S3 path from metadata and validate existence"""
    path = f"s3://bucket/{construct_path(metadata, identifier)}/"

    # CRITICAL: Validate path exists before using
    result = subprocess.run(
        ['aws', 's3', 'ls', path, '--no-sign-request'],
        capture_output=True, timeout=10
    )
    return path if result.returncode == 0 else None

# Apply inference to missing paths
for idx in df_target[df_target['_s3_path'].isna()].index:
    inferred = infer_s3_path(df_target.at[idx, 'id'],
                            df_target.at[idx, 'metadata'])
    if inferred:
        df_target.at[idx, '_s3_path'] = inferred

Key Considerations:

Validation is essential: Always verify inferred paths exist (prevent 404s during data fetch)
Performance: Path validation takes ~5-10s per entry (budget time for large datasets)
Test mode: Use TEST_MODE to validate inference on small sample first
Temporary columns: Use _s3_path prefix for columns removed before saving

Time Savings: For 668 missing paths @ 7s each = ~78 minutes of validation, but prevents hours of debugging failed fetches

Input File Priority Detection

Pattern: When enrichment notebooks can run multiple times, auto-detect the most complete input file:

# Check for enriched table with current date first
enriched_today = f"Genome_table_{TODAY}_enriched.tsv"

if os.path.exists(enriched_today):
    INPUT_FILE = enriched_today
    print(f"â Found enriched genome table from today: {INPUT_FILE}")
    print(f"  (Continuing enrichment of today's data)")
else:
    # Check for merged table with current date
    merged_today = f"Genome_table_{TODAY}_merged.tsv"

    if os.path.exists(merged_today):
        INPUT_FILE = merged_today
        print(f"â Found merged genome table from today: {INPUT_FILE}")
    else:
        # Fall back to latest merged or raw data
        merged_files = glob.glob("Genome_table_*_merged.tsv")
        merged_files.sort(reverse=True)

        if merged_files:
            INPUT_FILE = merged_files[0]
            print(f"â Using latest merged file: {INPUT_FILE}")
        else:
            INPUT_FILE = "VGP_VGL_genomes - raw data.tsv"
            print(f"â Using raw data: {INPUT_FILE}")

Priority Order: enriched_today > merged_today > latest_merged > raw

Benefits:

In-place enrichment: Updating same file prevents version proliferation
Idempotent: Re-running adds missing data without duplicates
Recovery: Can continue after interruption
Clear state: User knows exactly what data is being used

User Experience:

â Found enriched genome table from today: Genome_table_20260220_enriched.tsv
  (Continuing enrichment of today's data)

Idempotent Column Addition

When adding new columns to existing DataFrames (especially in notebooks that may be re-run):

# Pattern: Check before adding
new_columns = {
    'BUSCO completeness': float,
    'BUSCO lineage': str,
    'Merqury QV': float
}

columns_added = []
for col, dtype in new_columns.items():
    if col not in df.columns:
        if dtype == float:
            df[col] = np.nan
        else:
            df[col] = None  # or pd.NA for pandas 1.0+
        columns_added.append(col)

if columns_added:
    print(f"â Added {len(columns_added)} columns: {', '.join(columns_added)}")

Why idempotent operations matter:

Notebooks can be re-run without errors
Works for both initial runs and continuation of enrichment
Clear feedback on what changed
Supports incremental data updates

Anti-pattern:

# â Fails on re-run if column exists
df['New Column'] = np.nan
# Raises: ValueError: The column label 'New Column' is not unique

Type handling during enrichment:

# When source and target have different dtypes
for idx, row in df.iterrows():
    if pd.isna(row['Numeric Column']):
        value = external_data.get('value')  # Returns float or int

        # Handle object dtype columns (common after TSV loading)
        if df['Numeric Column'].dtype == 'object':
            value = str(int(value)) if value == int(value) else str(value)

        df.at[idx, 'Numeric Column'] = value

Critical Data Validation: Column Name Verification

The Problem: Column Names Can Be Misleading

Column names don’t always match their content. This can happen due to:

Copy-paste errors during data generation
Misunderstanding of source code logic
Legacy naming from previous versions
Automatic column naming that reverses order

Real-world example from VGP analysis: Dataset columns were named opposite of their actual content:

telomere_cat0_both_terminal â Actually contained “no telomeres” data (cat0)
telomere_cat1_one_terminal â Actually contained “both terminal” data (cat1)
telomere_cat2_no_terminal â Actually contained “one terminal” data (cat2)

Impact: Would have resulted in completely inverted biological interpretation (showing most chromosomes WITHOUT telomeres when the opposite was true).

The Solution: Always Verify Against Source Code

Before using any data column in analysis:

Find the source script that generated the data
Read the actual code that assigns values to columns
Verify column names match the logic
Add validation checks comparing to known controls

Example verification workflow:

# 1. Find source script (e.g., classify_telomeres.py)

# 2. Check the actual assignment logic in source:
"""
if terminal == 2 and interstitial == 0:
    category = 1  # Both terminal telomeres
elif terminal == 1 and interstitial == 0:
    category = 2  # One terminal telomere
else:
    category = 0  # No telomeres
"""

# 3. Verify column names match this logic
# Expected: cat0 = no telomeres, cat1 = both terminal, cat2 = one terminal

# 4. Check actual data against expectations
print("Cat1 (should be 'both terminal'):")
print(df['telomere_cat1_one_terminal'].describe())

# 5. Add biological plausibility check
# Most chromosomes should have at least some telomeres
pct_with_telomeres = ((df['telomere_cat1'] + df['telomere_cat2']) /
                       df['total_chr_scaffolds']).mean()
assert pct_with_telomeres > 0.5, f"Only {pct_with_telomeres:.1%} with telomeres - check column mapping!"

# 6. Cross-validate with known control samples
# Example: Chr1 of species X is known to have both terminal telomeres
control = df[(df['species'] == 'Homo_sapiens') & (df['chr'] == 'chr1')]
assert control['telomere_cat1_one_terminal'].sum() > 0, "Control chr1 should have cat1"

Prevention Checklist

Before finalizing any analysis using categorical columns:

Located source script that generated the data
Verified column assignment logic in source code
Cross-referenced column names with source logic
Added assertions for biological plausibility
Tested with known control samples
Documented any discrepancies found
Updated column names or added mapping if needed

When This Is Critical

Column verification is essential when:

Processing data from external scripts
Working with categorical classifications
Using data with non-obvious column names (cat0, cat1, cat2)
Column names contain ambiguous terms
Any time results seem biologically implausible

Red flags that suggest column verification needed:

â ï¸ Results contradict biological expectations
â ï¸ Percentages don’t sum to 100% when they should
â ï¸ Categories show inverse patterns to literature
â ï¸ Column names use generic terms (cat0, cat1, cat2)
â ï¸ Values seem swapped or inverted
â ï¸ Known controls don’t match expected values

Fixing Column Name Mismatches

Option 1: Rename columns in code

# Document the correction
"""
IMPORTANT: Column names are MISLABELED in the dataset!
Actual meanings based on classify_telomeres.py:
  cat0 = 0 terminal (no telomeres) - but column named "both_terminal"
  cat1 = 2 terminal (both terminal) - but column named "one_terminal"
  cat2 = 1 terminal (one terminal) - but column named "no_terminal"
"""

# Correct mapping with clear comments
df['telomere_pct_both_terminal'] = (df['telomere_cat1_one_terminal'] /  # cat1 = both terminal!
                                     df['total_chr_scaffolds'] * 100)
df['telomere_pct_one_terminal'] = (df['telomere_cat2_no_terminal'] /   # cat2 = one terminal!
                                    df['total_chr_scaffolds'] * 100)
df['telomere_pct_no_terminal'] = (df['telomere_cat0_both_terminal'] /   # cat0 = no terminal!
                                   df['total_chr_scaffolds'] * 100)

Option 2: Create corrected dataset

# Rename columns correctly
df_corrected = df.rename(columns={
    'telomere_cat0_both_terminal': 'telomere_cat0_no_terminal',
    'telomere_cat1_one_terminal': 'telomere_cat1_both_terminal',
    'telomere_cat2_no_terminal': 'telomere_cat2_one_terminal'
})

# Save with clear documentation
df_corrected.to_csv('data_corrected_columns.csv', index=False)

# Document in README
"""
CORRECTED: 2026-02-16
- Column names were mislabeled in original data
- Verified against source code: classify_telomeres.py
- Corrected column names now match actual content
"""

Documentation Template

When you discover and fix column mismatches, document thoroughly:

## Data Quality Issue: Column Name Mismatch

**Discovered**: [Date] during [Analysis name]

**Problem**: Column names did not match their content

**Columns affected**:
- `[column_name]` - Named as [X] but actually contains [Y]
- `[column_name]` - Named as [A] but actually contains [B]

**Source of truth**: [script name and line numbers]

**Impact**: [What would have gone wrong without correction]

**Fix**: [How columns were remapped or renamed]

**Validation**: [How correctness was verified]

DataFrame Type Conversion During Enrichment

The Problem: Type Mismatches When Enriching DataFrames

When enriching a DataFrame with data from external sources (CSV files, AWS, APIs), you often encounter type conversion errors:

# Common error:
TypeError: Invalid value '12345' for dtype object
# OR
TypeError: Cannot assign float64 to object column

Root cause: Target DataFrame column has a specific dtype (often object/string), but you’re trying to assign numeric or other typed values directly.

Why This Happens

Scenario: Genome metadata table with string columns being enriched from numeric sources.

# Original DataFrame (from Google Sheets or CSV)
df = pd.read_csv('genome_table.csv', dtype=str)  # All columns loaded as strings
print(df['Genome_size'].dtype)  # dtype: object

# Enrichment source (unified CSV with proper types)
unified_df = pd.read_csv('vgp_assemblies_unified.csv')
print(unified_df['asm_stats_haploid_number'].dtype)  # dtype: int64

# â FAILS: Trying to assign int to string column
df.loc[mask, 'Genome_size'] = unified_df['asm_stats_haploid_number']
# TypeError: Invalid value '4077481159' for dtype object

Why columns are strings:

Mixed data types in original source (numbers + text like “N/A”)
Explicit dtype=str during loading to preserve formatting
Previous text values in column
Pandas inference chose object dtype

The Solution: Type-Safe Assignment Pattern

Before assigning, check target dtype and convert if needed:

def safe_assign_to_string_column(df, row_mask, target_col, value):
    """
    Safely assign a value to a DataFrame column, handling type conversion.

    If target column is string type and value is numeric, convert to string first.
    """
    # Check if target column is string/object type
    if df[target_col].dtype == 'object' or pd.api.types.is_string_dtype(df[target_col]):
        # Convert numeric values to string
        if isinstance(value, (int, float, np.integer, np.floating)):
            # Remove unnecessary decimal places for integers
            if isinstance(value, float) and value == int(value):
                value = str(int(value))
            else:
                value = str(value)

    # Now safe to assign
    df.loc[row_mask, target_col] = value

# Usage
for idx, row in df_genome.iterrows():
    if condition:
        unified_value = unified_df.loc[unified_mask, 'asm_stats_haploid_number'].values[0]
        safe_assign_to_string_column(df_genome, idx, 'Genome_size', unified_value)

Vectorized Approach for Multiple Columns

For enriching many rows/columns efficiently:

# Define column mapping: genome_table_col -> unified_csv_col
column_mapping = {
    'Genome_size': 'asm_stats_haploid_number',
    'Heterozygosity': 'heterozygosity',
    'Repeat_content': 'repeat_percent',
    'Scaffold_N50': 'asm_stats_n50',
    'GC_content': 'gc_percent'
}

for genome_col, unified_col in column_mapping.items():
    # Get values from unified CSV
    enrichment_values = unified_df[unified_col]

    # Check if target column is string type
    is_string_target = (df_genome[genome_col].dtype == 'object' or
                        pd.api.types.is_string_dtype(df_genome[genome_col]))

    if is_string_target:
        # Convert numeric values to strings
        if pd.api.types.is_numeric_dtype(enrichment_values):
            # Handle integers vs floats
            enrichment_values = enrichment_values.apply(
                lambda x: str(int(x)) if pd.notna(x) and x == int(x) else str(x) if pd.notna(x) else x
            )

    # Now safe to assign
    df_genome.loc[mask, genome_col] = enrichment_values.values

Common Patterns and Solutions

Pattern 1: Integer values in string columns

# Source: int64 (4077481159)
# Target: object column
# â Solution: Convert to string
unified_value = 4077481159
if isinstance(unified_value, (int, np.integer)):
    unified_value = str(unified_value)
df.loc[idx, 'Genome_size'] = unified_value  # "4077481159"

Pattern 2: Float percentages in string columns

# Source: float64 (1.47)
# Target: object column
# â Solution: Convert to string, optionally format
unified_value = 1.47696
if isinstance(unified_value, (float, np.floating)):
    unified_value = f"{unified_value:.2f}"  # "1.48" or str(unified_value) # "1.47696"
df.loc[idx, 'Heterozygosity'] = unified_value

Pattern 3: Mixed type columns (some strings, some numbers)

# Some rows have "N/A", others should have numbers
# Target MUST be object dtype to hold both
df['Genome_size'] = df['Genome_size'].astype('object')  # Ensure object dtype

# Then assign with type checking
for idx, value in enrichments.items():
    if pd.notna(value):
        if isinstance(value, (int, float)):
            value = str(value)
    df.loc[idx, 'Genome_size'] = value

When to Convert Target Column Type Instead

Consider changing target dtype if:

Column should be numeric for calculations
All existing values are numeric or NaN
No need to preserve string formatting
Column will be used in statistical analysis

# Check if safe to convert
if df['Genome_size'].apply(lambda x: pd.isna(x) or str(x).replace('.','').isdigit()).all():
    # All values are numeric or NaN - safe to convert
    df['Genome_size'] = pd.to_numeric(df['Genome_size'], errors='coerce')
    # Now can assign numeric values directly
    df.loc[mask, 'Genome_size'] = unified_df['asm_stats_haploid_number']

Don’t convert if:

Source data mixes text and numbers (“N/A”, “Unknown”, etc.)
Need to preserve exact string representation
Column used for display/export, not calculations

Debugging Type Errors

When you get a type error, check:

# 1. Check target column dtype
print(f"Target column dtype: {df['Genome_size'].dtype}")
print(f"Target column type: {type(df['Genome_size'].iloc[0])}")

# 2. Check source value type
print(f"Source value type: {type(unified_value)}")
print(f"Source value: {unified_value}")

# 3. Check for mixed types in target column
print(df['Genome_size'].apply(type).value_counts())

# 4. Sample target column values
print(df['Genome_size'].head(10))

Real-World Example: VGP Genome Enrichment

Situation: Enriching genome table from two sources:

vgp_assemblies_unified.csv: Numeric dtypes (int64, float64)
GenomeArk AWS: Parsed from text files (strings)
Target: Genome table loaded with dtype=str to preserve original formatting

Problem:

df_genome.loc[idx, 'Genome_size'] = 4077481159  # int64
# TypeError: Invalid value '4077481159' for dtype object

Solution: Type-safe assignment pattern

# Check target dtype before assignment
genome_col = 'Genome_size'
if df_genome[genome_col].dtype == 'object':
    if isinstance(unified_value, (int, float)):
        unified_value = str(int(unified_value)) if unified_value == int(unified_value) else str(unified_value)

df_genome.loc[idx, genome_col] = unified_value  # Now works: "4077481159"

Result: Successfully enriched 11 fields across 716 genomes without type errors.

Best Practices

Check dtype before assignment – Don’t assume column types
Convert values to match target dtype – Easier than converting whole column
Use helper functions – Encapsulate type checking logic
Handle NaN explicitly – pd.notna() checks before conversion
Preserve integers as integers – Use int(x) before str() to avoid “.0”
Test on small sample first – Catch type errors early
Document dtype expectations – Comment why columns are string vs numeric

Prevention Checklist

Before enriching a DataFrame:

Check target column dtypes: df.dtypes
Check source value types: type(value) or df_source.dtypes
Decide: Convert values or convert column?
Implement type-safe assignment function
Test on 2-3 rows before full enrichment
Handle NaN/None cases explicitly

Best Practices

1. Assess Information Loss

Before deciding to reuse aggregated data, check:

# Can you perfectly reconstruct raw data from aggregates?
# If NO â Recalculate

2. Document Your Decision

"""
Data source: scaffold_telomere_data.csv (n=6,356 scaffolds)
Recalculated: 2026-01-29
Reason: Previous aggregation conflated terminal and interstitial presence
Method: [describe categorization logic]
"""

3. Validate Against Original if Possible

# If reusing aggregates, check consistency
original_total = df['cat1'] + df['cat2'] + df['cat3'] + df['cat4']
new_total = df['new_cat1'] + df['new_cat2'] + df['new_cat3']
assert (original_total == new_total).all(), "Category totals don't match!"

4. Time vs Accuracy Trade-off

Exploration phase: Approximations okay, clearly documented
Publication phase: Always recalculate for accuracy
Intermediate: Recalculate once, save results, reuse those

Real-World Example: VGP Telomere Analysis

Situation: Initial figure used 4 categories that conflated terminal and interstitial telomeres. Needed simplified 3-category system based ONLY on terminal count.

Attempted: Approximation by combining old categories Problem: Old cat3 included chromosomes with 0, 1, OR 2 terminal telomeres + interstitial

Solution:

Returned to scaffold-level data (11,812 scaffolds)
Recategorized based purely on num_terminal_telomeres field
Aggregated to species level (310 species)
Generated accurate Figure 5 for publication

Result:

Dual: 52% chromosomes with 1 terminal (accurate)
vs approximate: 12% (old cat2 only, missed many in old cat3)
40 percentage point difference – approximation would have been wildly wrong

Lesson: When category semantics change fundamentally, recalculation is not optional.

Performance Considerations

Recalculation is often faster than you think:

# Modern pandas on 10,000+ rows
start = time.time()
df['new_cat'] = df.apply(categorize_func, axis=1)
result = df.groupby('species').agg({'new_cat': 'value_counts'})
print(f"Recalculation: {time.time() - start:.2f}s")  # Often < 1 second

Optimize recalculation:

Use vectorized operations instead of apply() when possible
Filter to relevant columns before processing
Cache intermediate results if running multiple times

Organizing Analysis Text for Token Efficiency

Use Case

Scientific projects with:

Large Jupyter notebooks containing analysis + interpretation
Multiple figures requiring detailed analysis text
Need to reference specific analyses without loading entire notebook

The Problem

Notebook: 4.2 MB with code + analysis text â ~1,050,000 tokens
Can’t efficiently load just the analysis for one figure
Difficult to maintain and update analysis text in notebook cells
Hard to reuse analysis text in manuscript preparation

The Solution: analysis_files/ Directory

Separate computation (notebooks) from interpretation (markdown files):

Directory structure:

project/
âââ analysis_files/
â   âââ MANIFEST.md           # Guide to the analysis files
â   âââ Method.md             # Methods section
â   âââ figures/
â       âââ 01_figure1.md     # Analysis for figure 1
â       âââ 02_figure2.md     # Analysis for figure 2
â       âââ ...
âââ notebooks/
â   âââ Analysis.ipynb        # Code for computation & figures
âââ figures/
â   âââ output/
â       âââ 01_figure1.png
â       âââ 02_figure2.png
âââ data/

What Goes in Each File

analysis_files/figures/NN_name.md:

Figure description (figure legend format)
Statistical methods used
Analysis framework and interpretation
Mechanistic explanations
Context from other results
Biological/technical considerations
Publication-ready prose

analysis_files/Method.md:

Complete methods section
Dataset description
Statistical approaches
Data sources
Limitations

notebooks/*.ipynb:

Data loading and processing
Statistical computations
Figure generation code
Minimal text (link to analysis files instead)

Writing Style Guidelines

Base style on existing notebook/paper:

Read current notebook for analysis patterns
Read paper draft (if exists) for writing style
Match level of detail and technical depth
Include same types of explanations (statistical, mechanistic, biological)

Common elements in scientific analysis files:

Clear figure descriptions with n values
Statistical test details (test name, p-values, why chosen)
Interpretation framework (what comparison shows)
Mechanistic explanations (why differences occur)
Context (how relates to other results)
Practical implications
Limitations and caveats

Token Efficiency

Example savings:

Full notebook: 1,135,000 tokens
All analysis files: 22,000 tokens
Single figure analysis: 5,000 tokens
Reduction: 98%

Practical benefit: In 200K context window, can load all analyses + have 175K tokens for conversation.

Integration with MANIFEST System

Update manifests to link figures to analyses:

figures/MANIFEST.md:

**01_figure_name.png**
- **Description**: Brief description
- **Analysis file**: `../analysis_files/figures/01_figure_name.md`

analysis_files/MANIFEST.md:

Document purpose and usage
Explain token efficiency gains
Provide usage examples
Link to figure files

Managing TODOs

Critical: Keep analysis files clean and publication-ready

In analysis files (â DON’T):

## Analysis
[TO BE COMPLETED - add results here]
- [ ] Run statistical test
- [ ] Fill in p-values

In separate TODO note (â DO): Create Obsidian note or similar tracking document:

# Figure Analysis TODOs
## Figure 1
- [ ] Run Kruskal-Wallis test
- [ ] Get n for each category
- [ ] Fill in p-values in 01_figure1.md

Workflow

Setup: Create directory structure
Extract: Pull analysis text from notebooks
Clean: Create publication-ready markdown files
Track: Move TODOs to separate tracking system
Link: Update MANIFESTs with cross-references
Maintain: Update analysis files, not notebooks

When to Use This Pattern

â Good fit:

Multiple figures with detailed analyses
Large notebooks (>1 MB)
Preparing for publication
Frequent reference to specific analyses
Collaborative writing

â Overkill for:

Single figure projects
Exploratory analysis (not publication-bound)
Small notebooks (<500 KB)
Code-heavy, minimal text notebooks

Populating Analysis Files with Statistical Results

When framework analysis files have been created but need statistical results filled in:

Workflow pattern:

Use TodoWrite to track progress – Create todos for each file to fill in
Work sequentially through files – Complete one file before moving to next
Read framework file – Understand existing structure and placeholders
Add Statistical Results section with:
- Sample sizes and descriptive statistics table
- Statistical test results (with exact values)
- Achievement thresholds or accuracy metrics (if applicable)
Add Interpretation section with:
- Effect type analysis (isolating different experimental factors)
- Mechanistic explanations
- Practical implications
- Context from other metrics
- Limitations
- Conclusion
Mark todo completed immediately – Don’t batch completions

Example structure for each analysis file:

## Statistical Results

### Sample Sizes and Descriptive Statistics
| Category | n | Median | Mean Â± SEM | Q1 - Q3 |
|----------|---|--------|------------|---------|
[data table]

### Statistical Tests
**Kruskal-Wallis test** (three-group comparison):
- H statistic = X.XX
- p-value = X.XXX
- **Result**: [HIGHLY significant/Significant/NO significant] differences

**Post-hoc pairwise tests**:
- Category A vs B: p = X.XXX (interpretation)
- Category A vs C: p = X.XXX (interpretation)
- Category B vs C: p = X.XXX (interpretation)

## Interpretation

### [Primary Finding Title]
[Statistical interpretation paragraph]

### Implications by Effect Type
**1. Factor 1 Effect**: [Analysis isolating first factor]
**2. Factor 2 Effect**: [Analysis isolating second factor]
**3. Combined Effect**: [Overall pattern]

### [Mechanistic Explanation Section]
[Why differences occur]

### [Context from Other Metrics]
[How this relates to other findings]

### Limitations
[Study-specific caveats]

## Conclusion
[Summary paragraph with key takeaways]

Key practices:

Maintain consistent formatting across all analysis files
Include exact statistical values (H statistic, p-values, sample sizes)
Provide mechanistic explanations beyond just reporting significance
Cross-reference other findings to build coherent narrative
Document data issues clearly (e.g., implausible values, mismatched columns)
Use significance markers consistently: *** p<0.001, ** p<0.01, * p<0.05, ns

Data Quality Verification During Analysis

Detecting Implausible Values

When populating analysis files with statistical results, verify biological plausibility:

Red flags for data issues:

Values outside expected range:
- Ratios >1.0 when measuring partial detection (<1.0 expected)
- Ratios ~2.0 that match patterns from different metrics (e.g., diploid/haploid)
Inconsistency with related metrics:
- Figure 5 shows 34.5% telomere detection â Figure 5b should show ratio ~0.35, NOT 2.0
Biological impossibility:
- 200% detection of telomeres (can’t detect more than exist)
Pattern matching wrong metric:
- Values matching chromosome counts when expecting telomere counts

Example from Figure 5b issue:

## Statistical Results

### Sample Sizes and Descriptive Statistics

**IMPORTANT DATA ISSUE**: The values obtained appear to represent **chromosome achievement ratios** (similar to Figure 3) rather than **telomere detection ratios**. Expected telomere detection ratios based on Figure 5 results should be ~0.2-0.5, but observed values are ~1.9-2.0, matching chromosome ratio patterns.

| Category | n | Median | Mean Â± SEM | Q1 - Q3 |
|----------|---|--------|------------|---------|
[data table with problematic values]

**Note**: These ratios are consistent with diploid chromosome counts relative to haploid karyotype (expected ~2.0), NOT telomere detection ratios (which should be <1.0 and typically 0.2-0.5 based on Figure 5).

Documentation approach for data issues:

Flag clearly at the start of Statistical Results section
Provide evidence for the mismatch (multiple lines of reasoning)
Explain what data likely represents instead
Document implications of missing correct data
Recommend corrective actions for future analysis
Provide context from valid related metrics

Benefits:

Prevents propagation of erroneous conclusions
Creates clear audit trail for data quality issues
Guides future data collection or correction efforts
Maintains scientific integrity of analysis documentation

Multi-Factor Experimental Design Analysis

Analyzing Multi-Factor Experimental Designs

When experimental design has multiple factors (e.g., assembly architecture Ã curation method):

Three-category design pattern:

Category 1: Factor A + Factor B (e.g., Phased + Dual)
Category 2: Factor A + ~B (e.g., Phased + Single)
Category 3: ~A + ~B (e.g., Pri/alt + Single)

Enables isolation of effects:

Factor B effect: Compare Category 1 vs Category 2 (controls Factor A)
Factor A effect: Compare Category 2 vs Category 3 (controls Factor B)
Combined effect: Compare Category 1 vs Category 3 (both factors)

Interpretation framework:

### Implications by Effect Type

**1. [Factor B] Effect (Category1 vs Category2):**
- Comparison: [values]
- p-value: [X.XXX]
- **Finding**: [significant/not significant]
- **Interpretation**: [what this means for Factor B in isolation]

**2. [Factor A] Effect (Category2 vs Category3):**
- Comparison: [values]
- p-value: [X.XXX]
- **Finding**: [significant/not significant]
- **Interpretation**: [what this means for Factor A in isolation]

**3. Combined Effect:**
- If Category1 vs Category3 shows larger difference than either individual factor â synergistic
- If similar to one factor â that factor dominates
- If no difference despite individual effects â antagonistic

Example from VGP analysis:

Curation effect (Dual vs Single): Compare Phased+Dual vs Phased+Single
Assembly effect (Phased vs Pri/alt): Compare Phased+Single vs Pri/alt+Single
Result: Assembly architecture dominates (8Ã gap density difference), curation has no effect

Benefits:

Cleanly separates confounded effects
Identifies which factor drives observed differences
Enables mechanistic interpretation

Interpreting Paradoxical or Contradictory Results

Recognizing Paradoxes

Pattern: When one category performs BETTER on metric X but WORSE on related metrics Y and Z:

Example from VGP analysis:

Pri/alt: â HIGHER chromosome assignment (98.7%)
Pri/alt: â 8Ã MORE gaps, â 2-3Ã FEWER telomeres
Paradox: Higher assignment despite worse quality

Trade-off Hypothesis Framework

When contradictory patterns emerge, consider quality trade-offs:

### The Trade-off Hypothesis

The contrasting patterns suggest a fundamental **[dimension 1] vs [dimension 2] trade-off**:

**Maximize [dimension 1] ([approach A])**:
- â Better on metric X
- â Worse on metric Y
- â Worse on metric Z
- ð¤· Equivalent on metric W

**Maximize [dimension 2] ([approach B])**:
- â Worse on metric X
- â Better on metric Y
- â Better on metric Z
- ð¤· Equivalent on metric W

**Interpretation**: [Approach A] prioritizes [goal 1], while [Approach B] prioritizes [goal 2]. Neither approach is universally superior - the optimal choice depends on [application requirements].

Example application:

**Pri/alt: Maximize chromosome assignment**
- Liberal assignment criteria
- More sequence on chromosomes
- Result: High assignment%, but chromosomes have gaps/incomplete ends

**Phased: Maximize chromosome accuracy**
- Conservative assignment criteria
- Only high-confidence sequences on chromosomes
- Result: Lower assignment%, but chromosomes are higher quality

Benefits of trade-off framing:

Resolves apparent contradictions
Identifies different optimization strategies
Guides methodology selection based on priorities
Prevents oversimplified “best method” conclusions

Documenting Counter-Intuitive Results

When results contradict initial hypotheses:

Documentation pattern:

State the expectation clearly: “Dual curation was expected to improve telomere detection”
Present the contradictory finding: “Phased+Single median 34.5% vs Phased+Dual 19.5%”
Acknowledge the surprise: “Counter-intuitive finding”, “Opposite to expectation”
Explore mechanistic explanations: Why might the opposite be true?
- Reduced complexity in single curation
- Clearer Hi-C signals
- More focused curation effort
Consider confounds: Temporal factors, sample composition
State statistical significance clearly: p=0.213 (not significant, but trend exists)

Example:

**1. Curation Method Effect (Dual vs Single): Marginal, OPPOSITE Direction**

The comparison shows **no statistically significant difference** (p = 0.213), but reveals a **surprising trend favoring single curation**:

- **Phased+Single performs better**: Median 34.5% vs 19.5% (1.8Ã higher)
- **Counter-intuitive finding**: This is **opposite** to the expectation that dual curation would improve telomere detection

**Mechanistic interpretation**: Why might single curation perform better?
- Reduced complexity: Curators make more conservative, accurate decisions
- Clearer Hi-C signal: Single haplotype maps show clearer terminus signals
- Temporal factors: May benefit from more recent algorithms

Benefits:

Transparent about unexpected results
Stimulates mechanistic thinking
Prevents confirmation bias
Guides future experimental design

Species Name Reconciliation Across Data Sources

Problem

When using external phylogenetic tree services (TimeTree, NCBI Taxonomy, etc.), species names often don’t match your metadata due to:

Database standardization
Taxonomic updates
Spelling variants
OCR/data entry errors
Trailing whitespace

Three-Category Classification

Analyze mismatches by categorizing into:

Exact matches (already synchronized)
Systematic replacements (database standardization)
- Example: TimeTree replacing subspecies with species names
- Document in species_replacements.json
Name variants (spelling/case/whitespace)
- Example: Alca_torda vs Alca_Torda
- Document in name_variant_replacements.json

Reconciliation Workflow

# Step 1: Extract all species from tree
tree_species = set(re.findall(r'[A-Z][a-z]+_[a-z]+', tree_content))

# Step 2: Load metadata species
metadata_species = set(df['Species'].str.replace(' ', '_'))

# Step 3: Identify categories
exact_matches = tree_species & metadata_species
in_tree_only = tree_species - metadata_species
in_metadata_only = metadata_species - tree_species

# Step 4: Fuzzy match for variants
from difflib import get_close_matches
variants = {}
for tree_sp in in_tree_only:
    matches = get_close_matches(tree_sp, in_metadata_only, n=1, cutoff=0.8)
    if matches:
        variants[tree_sp] = matches[0]

Critical Decision: Which Version to Keep?

Rule of thumb:

Use metadata version when it’s your authoritative source
Exception: Remove trailing whitespace (causes file format issues)
Document the choice in README

Propagating Corrections

Once replacements are defined, apply to ALL related files:

replacements = json.load('name_replacements.json')

files_to_update = [
    'tree.nwk',
    'annotation_dataset1.txt',
    'annotation_dataset2.txt',
    # ... all files referencing species names
]

for filepath in files_to_update:
    content = read_file(filepath)
    for old, new in replacements.items():
        content = content.replace(old, new)
    write_file(filepath.replace('.txt', '_corrected.txt'), content)

Versioning Strategy

Use suffixes to track correction stages:

_original – Untouched file from external source
_corrected – After first round of replacements
_final – After all corrections applied

This enables:

Reproducibility
Easy rollback if errors found
Clear audit trail

Phylogenetic Tree Coverage Analysis

Coverage Metric Definition

When reconciling phylogenetic trees with species datasets, track coverage:

Coverage = (Species in both tree AND dataset) / (Total species in tree) Ã 100%

This metric indicates what percentage of the phylogenetic tree has data available for analysis.

Identifying Missing Species

Workflow:

Extract species from tree (Newick format):

import re
with open('Tree_final.nwk', 'r') as f:
    tree_content = f.read()
# Extract species names (underscored format)
tree_species = set(re.findall(r'([A-Z][a-z]+_[a-z]+)', tree_content))

Extract species from dataset:

import pandas as pd
df = pd.read_csv('species_methods.csv')
dataset_species = set(df['Species'].str.replace(' ', '_'))

Calculate coverage:

matched_species = tree_species & dataset_species
missing_from_dataset = tree_species - dataset_species
coverage_pct = (len(matched_species) / len(tree_species)) * 100

print(f"Coverage: {len(matched_species)}/{len(tree_species)} ({coverage_pct:.1f}%)")
print(f"Missing: {len(missing_from_dataset)} species")

Categorizing Missing Species

Not all missing species are equal. Categorize them:

Recoverable from data:
- Time Tree replacements (proxy species used)
- Species in deprecated datasets
- Species with different naming conventions
Phylogenetic context only:
- Species added by tree builder for phylogenetic completeness
- Reference species for temporal calibration
- Not in your study scope
Unknown/Uncategorizable:
- Species in dataset but cannot classify with current criteria
- May need different analysis approach

Recovery Workflow for Time Tree Replacements

Problem: Time Tree uses proxy species when exact species lacks data.

Example:

Tree contains: Anniella_pulchra (proxy with available phylogenetic data)
Dataset contains: Anniella_stebbinsi (actual species being studied)

Solution:

Document replacements in species_replacements.json:

{
  "actual_species_name": "tree_proxy_name",
  "Anniella_stebbinsi": "Anniella_pulchra",
  "Pelomedusa_somalica": "Pelomedusa_subrufa"
}

Check for actual species in deprecated datasets:

import json
replacements = json.load(open('species_replacements.json'))

for actual_sp, tree_sp in replacements.items():
    if tree_sp in missing_from_dataset:
        # Check deprecated datasets for actual_sp
        matches = df_deprecated[df_deprecated['Species'] == actual_sp.replace('_', ' ')]
        if not matches.empty:
            print(f"Found {actual_sp} in deprecated data - can recover!")

Update tree to use actual names:

# Replace proxy names with actual species names in tree file
tree_content_updated = tree_content
for actual_sp, tree_sp in replacements.items():
    tree_content_updated = tree_content_updated.replace(tree_sp, actual_sp)

with open('Tree_final.nwk', 'w') as f:
    f.write(tree_content_updated)

Synchronize all annotation files with the updated names.

Acceptable Coverage Levels

Guidelines for phylogenetic analysis:

100%: Ideal, all tree species have data
99%+: Excellent, few phylogenetic context species only
95-99%: Good, some context species expected
<95%: Investigate for recovery opportunities

Interpretation:

Missing 1-3 species at 99%+ often represents phylogenetic context species
These are acceptable and expected when tree includes reference taxa
Focus recovery efforts on species that should have data

Example Recovery Impact

Case study from VGP analysis:

Initial coverage: 506/511 species (99.0%)
Identified 2 Time Tree replacement species
Recovered from deprecated datasets
Updated tree with actual VGP species names
Final coverage: 508/511 (99.4%)
Remaining 3: Phylogenetic context species (acceptable)

Outcome: Improved coverage by identifying and correcting Time Tree proxy usage.

Best Practices

Always check for Time Tree replacements when coverage < 100%
Document all replacements in JSON for reproducibility
Update tree file to match dataset (not the reverse)
Synchronize ALL config files after tree updates
Accept phylogenetic context species as valid missing data
Track coverage metrics throughout analysis

Integration with Species Name Reconciliation

This coverage analysis complements the “Species Name Reconciliation Across Data Sources” section:

Reconciliation fixes name variants and spellings
Coverage analysis identifies true missing species vs naming issues
Together they ensure maximal tree-dataset alignment

Distinguishing True Variation from Power Limitations

Challenge: When analyzing multiple groups (clades, populations, cohorts), how to determine if lack of effect is real or just insufficient power?

Key Indicators of Power Limitations (not true null effects):

Sample size much smaller than groups showing effects
- Example: Amphibians n=28 vs Mammals n=160
- Effect could exist but be undetectable
Trend in same direction but non-significant
- Medians show similar pattern
- p-value >0.05 but effect direction matches larger groups
- Suggests real effect masked by noise
Category imbalance within group
- One category has only 2-5 samples
- Can’t perform robust pairwise comparisons
- Example: Amphibians Phased+Single n=2 (only 7% of clade)
Wider confidence intervals than other groups
- Large IQR or SD relative to median
- Indicates high variance from small sample

Key Indicators of True Null Effect:

Large sample with p>0.05 and narrow confidence intervals
- Sufficient power to detect effect if present
- Tight distribution suggests true lack of difference
Opposite direction from other groups
- Not just non-significant, but reversed
- Suggests biological difference is real
Significant in some metrics but not others within same group
- Group has power (shown by significant effects elsewhere)
- Null result more likely to be real

Reporting Recommendations:

For power-limited groups:

"No significant effects detected (all p>0.13), likely reflecting
insufficient statistical power (n=55) rather than true absence of
effects. The modest sample size and relatively balanced category
distribution reduce power to detect effects observed in larger clades."

For true null effects:

"Despite excellent statistical power (n=124), dual curation showed
no improvement in scaffold N50 (p=0.378), in marked contrast to
mammals (p=0.037). This suggests fundamental differences in how avian
genomes respond to curation intensity."

Example from Session (Reptiles vs Birds):

Reptiles (n=55): No significant effects â Power limitation
- Small sample, balanced categories, trends in expected direction
Birds (n=124): No N50 curation effect â True null
- Large sample, significant effects in OTHER metrics (gap density)
- Biological explanation: genomes already near-optimal for N50

Summary

Default to recalculation when:

Category definitions change
Features were conflated
Publication accuracy needed
Raw data is available

Document approximations when used, and validate them against subsets of recalculated data.

Separate conflated features into independent analyses for clarity.

For species name reconciliation:

Classify mismatches into systematic vs variant changes
Apply corrections consistently across all files
Version files to track correction stages
Document decisions and maintain reproducibility

Data Consolidation and Enrichment Workflows

Pattern: Consolidate â Enrich â Verify

When working with multiple intermediate dataset versions and external data sources:

1. Consolidation Phase

Identify the single source of truth (most recent, most complete version)
Archive all intermediate/backup versions with clear documentation
Create a consolidated dataset with enriched columns
Use descriptive backups: deprecated/data_backups_YYYYMMDD/

2. Enrichment Phase (AWS/External Data)

Create dedicated enrichment notebook (e.g., enrich_unified_csv.ipynb)
Use TEST_MODE for initial validation (small subset)
Add new columns to existing dataset (don’t replace)
Track enrichment coverage (% filled for each new column)

Example Configuration:

# Safety defaults for AWS enrichment
ENABLE_AWS_FETCH = False  # Start disabled
TEST_MODE = True          # Start with small sample
TEST_SAMPLE_SIZE = 5      # Validate before full run

# After validation
ENABLE_AWS_FETCH = True
TEST_MODE = False  # Full enrichment

3. Verification Phase

Verify all notebooks/scripts reference correct consolidated file
Check for deprecated file references in code
Update MANIFEST with new structure
Document enrichment coverage and sources

Key Files to Update:

Main dataset (add new columns, don’t replace)
Derivative datasets (e.g., 3-category subset)
MANIFEST.md (document changes, archive locations)
Column metadata documentation

Traceability: Always preserve:

Pre-enrichment backups
Enrichment logs (what was fetched, when, coverage)
README in deprecated folders explaining what was archived

Filtered Dataset Rebuilding Pattern

Problem

When you enrich a master dataset with new columns, all filtered/subset versions become out of sync.

Solution: Rebuild from Master

Pattern:

Update master dataset with new columns
Identify all filtered subsets (e.g., 3-category subset)
Rebuild each subset by re-applying the original filter to updated master
Verify row counts and categories match expectations

Example: 3-Category Dataset

# Load enriched master
df_master = pd.read_csv('data/vgp_assemblies_unified_corrected.csv')

# Re-apply original filter
valid_categories = ['Phased+Dual', 'Phased+Single', 'Pri/alt+Single']
df_3cat = df_master[df_master['category_combined'].isin(valid_categories)].copy()

# Verify
assert len(df_3cat) == expected_count
assert set(df_3cat['category_combined'].unique()) == set(valid_categories)

# Save
df_3cat.to_csv('data/vgp_assemblies_3categories.csv', index=False)

Key Steps:

Don’t manually merge new columns into subset – rebuild from master
Preserve filter logic – document the exact filter criteria
Verify counts – ensure category breakdown matches expectations
Update documentation – note rebuild date and reason in MANIFEST

Common Mistake: Trying to merge new columns into existing subset â leads to mismatches

Correct Approach: Always rebuild from enriched master using original filter

Confounding Analysis: Technology and Temporal Effects

Pattern: Technology Confounding in Temporal Analysis

Problem: Temporal trends may reflect technology adoption rather than methodology improvements

Example from VGP assemblies:

Gap density shows strong improvement over time (Ï = -0.54, p < 1e-40)
But is this better assembly methods or just CLRâHiFi technology shift?

Solution: Technology-Stratified Temporal Analysis

Three-stage approach:

Stage 1: Mixed-Technology Analysis (Baseline)

# All assemblies, mixed technologies
temporal_results_mixed = []
for category in ['Method_A', 'Method_B', 'All']:
    data = df[df['category'] == category][['metric', 'year']].dropna()
    rho, pvalue = stats.spearmanr(data['year'], data['metric'])
    temporal_results_mixed.append({
        'category': category,
        'rho': rho,
        'p_value': pvalue
    })

Stage 2: Technology-Only Subset Analysis

# Filter to single technology (e.g., HiFi only)
df_single_tech = df[df['technology'] == 'HiFi'].copy()

temporal_results_controlled = []
for category in ['Method_A', 'Method_B', 'All HiFi']:
    data = df_single_tech[df_single_tech['category'] == category][['metric', 'year']].dropna()
    rho, pvalue = stats.spearmanr(data['year'], data['metric'])
    temporal_results_controlled.append({
        'category': category,
        'rho': rho,
        'p_value': pvalue
    })

Stage 3: Compare Mixed vs Controlled

# Identify technology artifacts vs real trends
for metric in metrics:
    mixed_row = mixed_df[mixed_df['metric'] == metric]
    controlled_row = controlled_df[controlled_df['metric'] == metric]

    delta_rho = controlled_row['rho'] - mixed_row['rho']

    # Technology artifact: significant in mixed, not in controlled
    if mixed_row['p_value'] < 0.05 and controlled_row['p_value'] >= 0.05:
        print(f"â ï¸  TECHNOLOGY ARTIFACT: {metric}")
        print(f"    Mixed: Ï={mixed_row['rho']:.3f} (p={mixed_row['p_value']:.4e}) SIGNIFICANT")
        print(f"    Controlled: Ï={controlled_row['rho']:.3f} (p={controlled_row['p_value']:.4e}) NOT SIG")

    # Real trend: significant in both or only in controlled
    elif controlled_row['p_value'] < 0.05:
        print(f"â  REAL TEMPORAL TREND: {metric}")

Interpretation Guidelines

Technology Artifact Indicators:

Trend disappears when technology is held constant
|ÎÏ| > 0.15 (large change in correlation strength)
Timeline coincides with technology adoption (e.g., 2021 CLRâHiFi shift)

Real Temporal Trend Indicators:

Trend persists or strengthens in technology-controlled analysis
Consistent direction across different technology subsets
Gradual change rather than step-function at technology transition

Visualization: Side-by-Side Temporal Plots

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Mixed technology (with technology markers)
for tech, marker in [('HiFi', 'o'), ('CLR', 's')]:
    tech_data = df[df['technology'] == tech]
    ax1.scatter(tech_data['year'], tech_data['metric'],
                marker=marker, label=tech, alpha=0.6)
ax1.set_title('Mixed Technology (Confounded)')

# Single technology only
ax2.scatter(df_single_tech['year'], df_single_tech['metric'],
            marker='o', alpha=0.6)
ax2.set_title('HiFi Only (Controlled)')

When to Use This Pattern

Sequencing technology evolution: CLRâHiFi, IlluminaâNanopore
Software version changes: Major algorithm updates over time
Hardware improvements: Sequencer model changes
Protocol evolution: Wet lab method changes over study period

Output: Comparison Statistics Table

Save detailed comparison for manuscript reporting:

comparison_df = pd.DataFrame({
    'metric': metrics,
    'mixed_rho': [mixed results],
    'controlled_rho': [controlled results],
    'delta_rho': [differences],
    'mixed_pval': [p-values],
    'controlled_pval': [p-values],
    'interpretation': ['artifact' or 'real trend']
})
comparison_df.to_csv('temporal_trends_comparison.csv')

Real Example: VGP Assembly Quality (2019-2025)

Gap Density:

Mixed: Ï = -0.54 (p < 1e-40) – strong improvement
HiFi-only: Ï = -0.35 (p < 0.001) – weaker but still significant
Interpretation: Some improvement is technology shift, but real temporal improvement exists within HiFi

Scaffold N50:

Mixed: Ï = +0.15 (p < 0.001)
HiFi-only: Ï = +0.18 (p < 0.01)
Interpretation: Real temporal trend, not technology artifact

This pattern ensures conclusions about temporal improvements are robust to technology confounding.

GitHub 仓库 ↗ ← 返回陌讯 Skills 聚合平台