SNP calling exercise answers part 1

From 22126
Revision as of 14:51, 25 November 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Answers

Q1

First index the gVCF:

tabix -f -p vcf NA24694.gvcf.gz

Then run HaplotypeCaller:

gatk --java-options "-Xmx10g" HaplotypeCaller \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
  -L chr20 \
  -O NA24694.gvcf.gz \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -ERC GVCF

Then genotype the gVCF:

gatk GenotypeGVCFs \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -V NA24694.gvcf.gz \
  -L chr20 \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -O NA24694.vcf.gz

Finally, count SNPs using bcftools stats:

bcftools stats NA24694.vcf.gz

You should see something like:

SN    0    number of SNPs:    75684

Answer: There are 75,684 SNPs on chromosome 20 in this sample.


Q2

Index the VCF if not already indexed:

tabix -p vcf NA24694.vcf.gz

Then query the 1 Mb region:

tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l

Answer: 1290 variant sites in this region (all variant types).


Q3

Count only SNPs (exclude indels and multi-allelic sites):

bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l

or equivalently:

tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
  | bcftools view -H --type snps - \
  | wc -l

Answer: 956 SNPs in the region.


Q4

Variant at chr20:32011209

tabix NA24694.vcf.gz chr20:32011209-32011209
chr20  32011209  rs147652161  G  A  264.64  .  
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552  
GT:AD:DP:GQ:PL  0/1:15,9:24:99:272,0,533

Interpretation:

  • Genotype: 0/1 (heterozygous G/A)
  • Allele depth (AD): 15 G and 9 A
  • Depth (DP): 24
  • Genotype quality (GQ): 99
  • Genotype likelihoods (PL): 272,0,533 (het most likely)

Variant at chr20:32044279

tabix NA24694.vcf.gz chr20:32044279-32044279
chr20  32044279  rs4525768  C  T  799.06  .  
AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99  
GT:AD:DP:GQ:PL  1/1:0,21:21:63:813,63,0

Interpretation:

  • Genotype: 1/1 (homozygous T/T)
  • Allele depth (AD): 0 C, 21 T
  • Depth (DP): 21
  • Genotype quality (GQ): 63
  • Genotype likelihoods (PL): 813,63,0 (homozygous alt most likely)

Q5

Higher-quality heterozygous SNP:

chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL  0/1:8,13:21:99:411,0,247

Lower-quality heterozygous SNP:

chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL  0/1:4,2:6:38:38,0,114

Why?

  • The first site has much higher depth (21× vs 6×)
  • Allele balance is more reasonable: 8/13 vs 4/2
  • The genotype quality (GQ) is much higher: 99 vs 38
  • Overall likelihoods strongly support the correct genotype

Conclusion: More data ⇒ higher confidence.


Q6

Count SNPs with no dbSNP ID (column 3 = .):

bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep -v rs \
  | wc -l

or equivalently:

bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep "\." \
  | wc -l

Answer: 17 novel SNPs


Q7

You found:

  • 956 total SNPs in the region (Q3)
  • 17 novel SNPs (Q6)

Only ~1.8% of SNPs are novel.

This is expected because:

  • Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.
  • dbSNP contains >100 million known variants, so most common variation is already catalogued.
  • Novel variants tend to be rare or extremely rare.

Conclusion: The number of novel variants is small and biologically reasonable.