bio-gatk-variant-calling
3
总安装量
3
周安装量
#56980
全站排名
安装命令
npx skills add https://github.com/gptomics/bioskills --skill bio-gatk-variant-calling
Agent 安装分布
opencode
2
codex
2
claude-code
2
antigravity
2
gemini-cli
2
windsurf
1
Skill 文档
GATK Variant Calling
GATK HaplotypeCaller is the gold standard for germline variant calling. This skill covers the GATK Best Practices workflow.
Prerequisites
BAM files should be preprocessed:
- Mark duplicates
- Base quality score recalibration (BQSR) – optional but recommended
Single-Sample Calling
Basic HaplotypeCaller
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.vcf.gz
With Standard Annotations
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.vcf.gz \
-A Coverage \
-A QualByDepth \
-A FisherStrand \
-A StrandOddsRatio \
-A MappingQualityRankSumTest \
-A ReadPosRankSumTest
Target Intervals (Exome/Panel)
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-L targets.interval_list \
-O sample.vcf.gz
Adjust Calling Confidence
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.vcf.gz \
--standard-min-confidence-threshold-for-calling 20
GVCF Workflow (Recommended for Cohorts)
The GVCF workflow enables joint genotyping across samples for better variant calls.
Step 1: Generate GVCFs per Sample
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.g.vcf.gz \
-ERC GVCF
Step 2: Combine GVCFs (GenomicsDBImport)
# Create sample map file
# sample_map.txt:
# sample1 /path/to/sample1.g.vcf.gz
# sample2 /path/to/sample2.g.vcf.gz
gatk GenomicsDBImport \
--genomicsdb-workspace-path genomicsdb \
--sample-name-map sample_map.txt \
-L intervals.interval_list
Alternative: CombineGVCFs (smaller cohorts)
gatk CombineGVCFs \
-R reference.fa \
-V sample1.g.vcf.gz \
-V sample2.g.vcf.gz \
-V sample3.g.vcf.gz \
-O cohort.g.vcf.gz
Step 3: Joint Genotyping
# From GenomicsDB
gatk GenotypeGVCFs \
-R reference.fa \
-V gendb://genomicsdb \
-O cohort.vcf.gz
# From combined GVCF
gatk GenotypeGVCFs \
-R reference.fa \
-V cohort.g.vcf.gz \
-O cohort.vcf.gz
Variant Quality Score Recalibration (VQSR)
Machine learning-based filtering using known variant sites. Requires many variants (WGS preferred).
SNP Recalibration
# Build SNP model
gatk VariantRecalibrator \
-R reference.fa \
-V cohort.vcf.gz \
--resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap.vcf.gz \
--resource:omni,known=false,training=true,truth=false,prior=12.0 omni.vcf.gz \
--resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode SNP \
-O snp.recal \
--tranches-file snp.tranches
# Apply SNP filter
gatk ApplyVQSR \
-R reference.fa \
-V cohort.vcf.gz \
-O cohort.snp_recal.vcf.gz \
--recal-file snp.recal \
--tranches-file snp.tranches \
--truth-sensitivity-filter-level 99.5 \
-mode SNP
Indel Recalibration
# Build Indel model
gatk VariantRecalibrator \
-R reference.fa \
-V cohort.snp_recal.vcf.gz \
--resource:mills,known=false,training=true,truth=true,prior=12.0 Mills.vcf.gz \
--resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp.vcf.gz \
-an QD -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode INDEL \
--max-gaussians 4 \
-O indel.recal \
--tranches-file indel.tranches
# Apply Indel filter
gatk ApplyVQSR \
-R reference.fa \
-V cohort.snp_recal.vcf.gz \
-O cohort.vqsr.vcf.gz \
--recal-file indel.recal \
--tranches-file indel.tranches \
--truth-sensitivity-filter-level 99.0 \
-mode INDEL
Hard Filtering (When VQSR Not Suitable)
For small datasets, exomes, or single samples where VQSR fails.
Extract SNPs and Indels
gatk SelectVariants \
-R reference.fa \
-V cohort.vcf.gz \
--select-type-to-include SNP \
-O snps.vcf.gz
gatk SelectVariants \
-R reference.fa \
-V cohort.vcf.gz \
--select-type-to-include INDEL \
-O indels.vcf.gz
Apply Hard Filters
# Filter SNPs
gatk VariantFiltration \
-R reference.fa \
-V snps.vcf.gz \
-O snps.filtered.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 60.0" --filter-name "FS60" \
--filter-expression "MQ < 40.0" --filter-name "MQ40" \
--filter-expression "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
--filter-expression "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
--filter-expression "SOR > 3.0" --filter-name "SOR3"
# Filter Indels
gatk VariantFiltration \
-R reference.fa \
-V indels.vcf.gz \
-O indels.filtered.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "QD2" \
--filter-expression "FS > 200.0" --filter-name "FS200" \
--filter-expression "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20" \
--filter-expression "SOR > 10.0" --filter-name "SOR10"
Merge Filtered Variants
gatk MergeVcfs \
-I snps.filtered.vcf.gz \
-I indels.filtered.vcf.gz \
-O cohort.filtered.vcf.gz
Base Quality Score Recalibration (BQSR)
Preprocessing step to correct systematic errors in base quality scores.
Step 1: BaseRecalibrator
gatk BaseRecalibrator \
-R reference.fa \
-I sample.bam \
--known-sites dbsnp.vcf.gz \
--known-sites known_indels.vcf.gz \
-O recal_data.table
Step 2: ApplyBQSR
gatk ApplyBQSR \
-R reference.fa \
-I sample.bam \
--bqsr-recal-file recal_data.table \
-O sample.recal.bam
Parallel Processing
Scatter by Interval
# Split calling across intervals
for interval in chr{1..22} chrX chrY; do
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-L $interval \
-O sample.${interval}.g.vcf.gz \
-ERC GVCF &
done
wait
# Gather GVCFs
gatk GatherVcfs \
-I sample.chr1.g.vcf.gz \
-I sample.chr2.g.vcf.gz \
... \
-O sample.g.vcf.gz
Native Pairwise Parallelism
gatk HaplotypeCaller \
-R reference.fa \
-I sample.bam \
-O sample.vcf.gz \
--native-pair-hmm-threads 4
CNN Score Variant Filter (Deep Learning)
Alternative to VQSR using convolutional neural network.
Score Variants
gatk CNNScoreVariants \
-R reference.fa \
-V cohort.vcf.gz \
-O cohort.cnn_scored.vcf.gz \
--tensor-type reference
Filter by CNN Score
gatk FilterVariantTranches \
-V cohort.cnn_scored.vcf.gz \
-O cohort.cnn_filtered.vcf.gz \
--resource hapmap.vcf.gz \
--resource mills.vcf.gz \
--info-key CNN_1D \
--snp-tranche 99.95 \
--indel-tranche 99.4
Complete Single-Sample Pipeline
#!/bin/bash
SAMPLE=$1
REF=reference.fa
DBSNP=dbsnp.vcf.gz
KNOWN_INDELS=known_indels.vcf.gz
# BQSR
gatk BaseRecalibrator -R $REF -I ${SAMPLE}.bam \
--known-sites $DBSNP --known-sites $KNOWN_INDELS \
-O ${SAMPLE}.recal.table
gatk ApplyBQSR -R $REF -I ${SAMPLE}.bam \
--bqsr-recal-file ${SAMPLE}.recal.table \
-O ${SAMPLE}.recal.bam
# Call variants
gatk HaplotypeCaller -R $REF -I ${SAMPLE}.recal.bam \
-O ${SAMPLE}.g.vcf.gz -ERC GVCF
# Single-sample genotyping
gatk GenotypeGVCFs -R $REF -V ${SAMPLE}.g.vcf.gz \
-O ${SAMPLE}.vcf.gz
# Hard filter
gatk VariantFiltration -R $REF -V ${SAMPLE}.vcf.gz \
-O ${SAMPLE}.filtered.vcf.gz \
--filter-expression "QD < 2.0" --filter-name "LowQD" \
--filter-expression "FS > 60.0" --filter-name "HighFS" \
--filter-expression "MQ < 40.0" --filter-name "LowMQ"
Key Annotations
| Annotation | Description | Good Values |
|---|---|---|
| QD | Quality by Depth | > 2.0 |
| FS | Fisher Strand | < 60 (SNP), < 200 (Indel) |
| SOR | Strand Odds Ratio | < 3 (SNP), < 10 (Indel) |
| MQ | Mapping Quality | > 40 |
| MQRankSum | MQ Rank Sum Test | > -12.5 |
| ReadPosRankSum | Read Position Rank Sum | > -8.0 (SNP), > -20.0 (Indel) |
Resource Files
| Resource | Use |
|---|---|
| dbSNP | Known variants (prior=2.0) |
| HapMap | Training/truth SNPs (prior=15.0) |
| Omni | Training SNPs (prior=12.0) |
| 1000G SNPs | Training SNPs (prior=10.0) |
| Mills Indels | Training/truth indels (prior=12.0) |
Related Skills
- variant-calling – bcftools alternative
- alignment-files – BAM preprocessing
- filtering-best-practices – Post-calling filtering
- variant-normalization – Normalize before annotation
- vep-snpeff-annotation – Annotate final calls