bio-population-genetics-association-testing

📁 gptomics/bioskills 📅 Jan 24, 2026
3
总安装量
3
周安装量
#60703
全站排名
安装命令
npx skills add https://github.com/gptomics/bioskills --skill bio-population-genetics-association-testing

Agent 安装分布

opencode 2
codex 2
claude-code 2
windsurf 1
openclaw 1
trae 1

Skill 文档

Association Testing

GWAS analysis using PLINK 2.0’s unified --glm command for case-control and quantitative traits.

PLINK 2.0 Association Testing

Basic Case-Control (Binary Phenotype)

# Basic logistic regression
plink2 --bfile data --glm --out results

# With phenotype file
plink2 --bfile data --pheno pheno.txt --glm --out results

Quantitative Trait (Continuous Phenotype)

# Linear regression for quantitative traits
plink2 --bfile data --pheno pheno.txt --glm --out results

With Covariates

# Include covariates (sex, age, PCs)
plink2 --bfile data \
    --pheno pheno.txt \
    --covar covariates.txt \
    --glm --out results

# Specify which covariates to use
plink2 --bfile data \
    --pheno pheno.txt \
    --covar covariates.txt \
    --covar-name PC1,PC2,PC3,age,sex \
    --glm --out results

Covariate Files

Phenotype File Format

# pheno.txt: FID IID pheno
# For binary: 1=control, 2=case, -9=missing
# For quantitative: continuous values
FAM001 IND001 2
FAM002 IND002 1
FAM003 IND003 1.5

Covariate File Format

# covariates.txt: FID IID cov1 cov2 ...
FAM001 IND001 0.15 35 1
FAM002 IND002 -0.22 42 2
FAM003 IND003 0.08 28 1

GLM Options

Phenotype Handling

# Multiple phenotypes (test all)
plink2 --bfile data --pheno pheno_multi.txt --glm --out results

# Specific phenotype column
plink2 --bfile data --pheno pheno_multi.txt --pheno-name trait1 --glm --out results

# Missing phenotype handling
plink2 --bfile data --glm allow-no-covars --out results

Model Options

# Additive model (default)
plink2 --bfile data --glm --out results

# Dominant model
plink2 --bfile data --glm dominant --out results

# Recessive model
plink2 --bfile data --glm recessive --out results

# Genotypic (2df test)
plink2 --bfile data --glm genotypic --out results

# Hide covariates from output (cleaner output)
plink2 --bfile data --covar cov.txt --glm hide-covar --out results

Firth Regression (Rare Variants)

# Enable Firth fallback for case-control (default in PLINK 2.0)
plink2 --bfile data --glm firth-fallback --out results

# Force Firth regression
plink2 --bfile data --glm firth --out results

# Disable Firth
plink2 --bfile data --glm no-firth --out results

Output Format

Output Columns

# Default output: results.PHENO1.glm.logistic or results.PHENO1.glm.linear
# Columns: CHROM, POS, ID, REF, ALT, A1, FIRTH?, TEST, OBS_CT, OR/BETA, SE, Tstat, P

Custom Output Columns

# Add specific columns
plink2 --bfile data --glm cols=+a1freq,+machr2 --out results

# Available columns:
# +a1freq: A1 allele frequency
# +machr2: MaCH R-squared
# +ax: Reference allele dosage
# +err: Standard errors

Population Stratification Control

Include Principal Components

# 1. Run PCA
plink2 --bfile data --pca 10 --out pca_results

# 2. Use PCs as covariates
plink2 --bfile data \
    --pheno pheno.txt \
    --covar pca_results.eigenvec \
    --covar-name PC1,PC2,PC3,PC4,PC5 \
    --glm --out results

Combined Workflow

# QC, PCA, and GWAS in sequence
plink2 --bfile raw --maf 0.01 --geno 0.05 --hwe 1e-6 --make-bed --out qc
plink2 --bfile qc --pca 10 --out pca
plink2 --bfile qc \
    --pheno pheno.txt \
    --covar pca.eigenvec \
    --covar-name PC1-PC5 \
    --glm hide-covar --out gwas

Result Filtering

Command Line Filtering

# Filter significant results
awk 'NR==1 || $13 < 5e-8' results.PHENO1.glm.logistic > significant.txt

# Extract top hits
sort -k13 -g results.PHENO1.glm.logistic | head -100 > top_hits.txt

Python Analysis

import pandas as pd

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')

significant = results[results['P'] < 5e-8]
print(f'Genome-wide significant hits: {len(significant)}')

suggestive = results[results['P'] < 1e-5]
print(f'Suggestive hits: {len(suggestive)}')

Visualization

Manhattan Plot (Python)

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
results = results[results['TEST'] == 'ADD']
results['-log10P'] = -np.log10(results['P'])

chrom_colors = ['#1f77b4', '#ff7f0e']
results['color'] = results['#CHROM'].apply(lambda x: chrom_colors[x % 2])

cumulative_pos = []
offset = 0
for chrom in sorted(results['#CHROM'].unique()):
    chrom_data = results[results['#CHROM'] == chrom]
    cumulative_pos.extend(chrom_data['POS'] + offset)
    offset += chrom_data['POS'].max()

results['cumulative_pos'] = cumulative_pos

plt.figure(figsize=(14, 6))
plt.scatter(results['cumulative_pos'], results['-log10P'], c=results['color'], s=1)
plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--', label='Genome-wide (5e-8)')
plt.axhline(y=-np.log10(1e-5), color='blue', linestyle='--', label='Suggestive (1e-5)')
plt.xlabel('Chromosome')
plt.ylabel('-log10(P)')
plt.legend()
plt.savefig('manhattan.png', dpi=150)

QQ Plot (Python)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
observed_p = results[results['TEST'] == 'ADD']['P'].dropna().sort_values()

n = len(observed_p)
expected_p = np.arange(1, n + 1) / (n + 1)

plt.figure(figsize=(6, 6))
plt.scatter(-np.log10(expected_p), -np.log10(observed_p), s=1)
plt.plot([0, 8], [0, 8], 'r--')
plt.xlabel('Expected -log10(P)')
plt.ylabel('Observed -log10(P)')

lambda_gc = np.median(stats.chi2.ppf(1 - observed_p, 1)) / stats.chi2.ppf(0.5, 1)
plt.title(f'QQ Plot (λ = {lambda_gc:.3f})')
plt.savefig('qqplot.png', dpi=150)

Genomic Inflation

from scipy import stats
import numpy as np

results = pd.read_csv('results.PHENO1.glm.logistic', sep='\t')
pvalues = results[results['TEST'] == 'ADD']['P'].dropna()

chisq = stats.chi2.ppf(1 - pvalues, 1)
lambda_gc = np.median(chisq) / stats.chi2.ppf(0.5, 1)
print(f'Genomic inflation factor: {lambda_gc:.3f}')
# Good: 1.0-1.05, Acceptable: 1.05-1.1, Concerning: >1.1

Related Skills

  • plink-basics – Data preparation and QC
  • population-structure – PCA for stratification control
  • linkage-disequilibrium – LD pruning before analysis