SNP calling exercise answers part 1
Answers
Q1
First index the gVCF:
tabix -f -p vcf NA24694.gvcf.gz
Then run HaplotypeCaller:
gatk --java-options "-Xmx10g" HaplotypeCaller \ -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \ -L chr20 \ -O NA24694.gvcf.gz \ --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -ERC GVCF
Then genotype the gVCF:
gatk GenotypeGVCFs \ -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -V NA24694.gvcf.gz \ -L chr20 \ --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -O NA24694.vcf.gz
Finally, count SNPs using bcftools stats:
bcftools stats NA24694.vcf.gz
You should see something like:
SN 0 number of SNPs: 75684
Answer: There are 75,684 SNPs on chromosome 20 in this sample.
Q2
Index the VCF if not already indexed:
tabix -p vcf NA24694.vcf.gz
Then query the 1 Mb region:
tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l
Answer: 1290 variant sites in this region (all variant types).
Q3
Count only SNPs (exclude indels and multi-allelic sites):
bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l
or equivalently:
tabix -h NA24694.vcf.gz chr20:32000000-33000000 \ | bcftools view -H --type snps - \ | wc -l
Answer: 956 SNPs in the region.
Q4
Variant at chr20:32011209
tabix NA24694.vcf.gz chr20:32011209-32011209
chr20 32011209 rs147652161 G A 264.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552 GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
Interpretation:
- Genotype: 0/1 (heterozygous G/A)
- Allele depth (AD): 15 G and 9 A
- Depth (DP): 24
- Genotype quality (GQ): 99
- Genotype likelihoods (PL): 272,0,533 (het most likely)
Variant at chr20:32044279
tabix NA24694.vcf.gz chr20:32044279-32044279
chr20 32044279 rs4525768 C T 799.06 . AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99 GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
Interpretation:
- Genotype: 1/1 (homozygous T/T)
- Allele depth (AD): 0 C, 21 T
- Depth (DP): 21
- Genotype quality (GQ): 63
- Genotype likelihoods (PL): 813,63,0 (homozygous alt most likely)
Q5
Higher-quality heterozygous SNP:
chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
Lower-quality heterozygous SNP:
chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
Why?
- The first site has much higher depth (21× vs 6×)
- Allele balance is more reasonable: 8/13 vs 4/2
- The genotype quality (GQ) is much higher: 99 vs 38
- Overall likelihoods strongly support the correct genotype
Conclusion: More data ⇒ higher confidence.
Q6
Count SNPs with no dbSNP ID (column 3 = .):
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | cut -f 3 \ | grep -v rs \ | wc -l
or equivalently:
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | cut -f 3 \ | grep "\." \ | wc -l
Answer: 17 novel SNPs
Q7
You found:
- 956 total SNPs in the region (Q3)
- 17 novel SNPs (Q6)
Only ~1.8% of SNPs are novel.
This is expected because:
- Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.
- dbSNP contains >100 million known variants, so most common variation is already catalogued.
- Novel variants tend to be rare or extremely rare.
Conclusion: The number of novel variants is small and biologically reasonable.