SNP calling exercise answers part 1
Q1
First, running:
tabix -f -p vcf NA24694.gvcf.gz
then
gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam -L chr20 -O NA24694.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF
gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V NA24694.gvcf.gz -L chr20 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -O NA24694.vcf.gz
bcftools stats NA24694.vcf.gz
Should give you:
SN 0 number of SNPs: 75684
so 75684 SNPs.
Q2
First run:
tabix -p vcf NA24694.vcf.gz
Then:
tabix NA24694.vcf.gz chr20:32000000-33000000 |wc -l
So 1290 variant sites.
Q3
There are 2 ways to do this:
bcftools view -H --type=snps NA24694.vcf.gz chr20:32000000-33000000 |wc -l
or
tabix -h NA24694.vcf.gz chr20:32000000-33000000 |bcftools view -H --type=snps - |wc -l
Both will give you: 956.
Q4
tabix NA24694.vcf.gz chr20:32011209-32011209
You get:
chr20 32011209 rs147652161 G A 264.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.010e-01;DB;DP=24;ExcessHet=3.0103;FS=3.949;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=11.03;ReadPosRankSum=-3.580e-01;SOR=0.552 GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
The line above says that you probably have G and A and the site is heterozygous. G is the reference and A the alternative allele. The allele depth is 15Gs, 9As, the depth is 24, the genotype quality is 99, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 272,0,533
tabix NA24694.vcf.gz chr20:32044279-32044279
You get:
chr20 32044279 rs4525768 C T 799.06 . AC=2;AF=1.00;AN=2;DB;DP=21;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.99;SOR=0.990 GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
The line above says that you are probably homozygous for T. C is the reference and T the alternative allele. The allele depth is 0Cs, 21Ts, the depth is 21, the genotype quality is 63, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 813,63,0.
Q5
Both are heterozygous sites however this is the better one:
chr20 32974911 rs6088051 A G 403.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.693;DB;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.78;MQRankSum=-6.860e-01;QD=19.22;ReadPosRankSum=-1.703e+00;SOR=0.871 GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
and the worse one:
chr20 64291638 rs369221086 C T 30.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.530e-01;DB;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.18;MQRankSum=-9.670e-01;QD=5.11;ReadPosRankSum=-8.420e-01;SOR=0.693 GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
The first site has a depth of 21, 8As and 13Gs. In the second case, the genotype quality is 38 and the first was 99. The second one only has 4Cs and 2 Ts for a total depth of 6 which is not sufficient to confidently call a heterozygous site.
Q6
Either use:
bcftools view -H --types=snps NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep -v rs |wc -l
or
bcftools view -H --types=snps NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep "\." |wc -l
you will get 17.
Q7
17 is very little compared to the number of SNPs 956. However, this is very expected. Given that the individual is Han Chinese, this ethnic group is very well represented in dbSNP.
mkdir var_recal/ gatk --java-options "-Xmx10G" VariantRecalibrator -V NA24694.vcf.gz --rscript-file var_recal/NA24694_plots.R -O var_recal/NA24694_recal -mode SNP --tranches-file var_recal/NA24694_tranches -tranche 99.0 -tranche 95.0 -tranche 90.0 -tranche 85.0 -tranche 80.0 -tranche 75.0 -tranche 70.0 -tranche 65.0 -tranche 60.0 -tranche 58.0 -an QD -an DP -an FS -an MQRankSum -an ReadPosRankSum -an SOR -an MQ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /home/databases/databases/GRCh38/hapmap_3.3.hg38.vcf.gz -resource:omni,known=false,training=true,truth=false,prior=12.0 /home/databases/databases/GRCh38/1000G_omni2.5.hg38.vcf.gz -resource:1000G,known=false,training=true,truth=false,prior=10.0 /home/databases/databases/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz