SNP calling exercise answers part 1: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 1: Line 1:
'''Q1'''
<h2>Answers</h2>
 
<h3>Q1</h3>
 
<p>First index the gVCF:</p>


First, running:
<pre>
<pre>
tabix -f -p vcf NA24694.gvcf.gz
tabix -f -p vcf NA24694.gvcf.gz
</pre>
</pre>
then
 
<p>Then run HaplotypeCaller:</p>
 
<pre>
<pre>
gatk --java-options "-Xmx10g" HaplotypeCaller   -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa   -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam -L chr20 -O NA24694.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF  
gatk --java-options "-Xmx10g" HaplotypeCaller \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
  -L chr20 \
  -O NA24694.gvcf.gz \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -ERC GVCF
</pre>
</pre>
<p>Then genotype the gVCF:</p>


<pre>
<pre>
gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V NA24694.gvcf.gz -L chr20 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -O NA24694.vcf.gz
gatk GenotypeGVCFs \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -V NA24694.gvcf.gz \
  -L chr20 \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -O NA24694.vcf.gz
</pre>
</pre>
<p>Finally, count SNPs using <code>bcftools stats</code>:</p>


<pre>
<pre>
bcftools stats NA24694.vcf.gz
bcftools stats NA24694.vcf.gz
</pre>
</pre>


Should give you:
<p>You should see something like:</p>
 
<pre>
<pre>
SN 0 number of SNPs: 75684
SN   0   number of SNPs:   75684
</pre>
</pre>
so 75684 SNPs.


<p><b>Answer:</b> There are <b>75,684 SNPs</b> on chromosome 20 in this sample.</p>
<hr>
<h3>Q2</h3>


'''Q2'''
<p>Index the VCF if not already indexed:</p>


First run:
<pre>
<pre>
tabix -p vcf NA24694.vcf.gz
tabix -p vcf NA24694.vcf.gz
</pre>
</pre>


Then:
<p>Then query the 1 Mb region:</p>
 
<pre>
<pre>
tabix NA24694.vcf.gz chr20:32000000-33000000 |wc -l
tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l
</pre>
</pre>
So 1290 variant sites.


'''Q3'''
<p><b>Answer:</b> <b>1290 variant sites</b> in this region (all variant types).</p>
 
<hr>
 
<h3>Q3</h3>
 
<p>Count only SNPs (exclude indels and multi-allelic sites):</p>


There are 2 ways to do this:
<pre>
<pre>
bcftools view -H --type=snps NA24694.vcf.gz chr20:32000000-33000000 |wc -l  
bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l
</pre>
</pre>
or
 
<p>or equivalently:</p>
 
<pre>
<pre>
tabix -h NA24694.vcf.gz chr20:32000000-33000000 |bcftools view -H --type=snps - |wc -l
tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
  | bcftools view -H --type snps - \
  | wc -l
</pre>
</pre>
Both will give you: 956.


<p><b>Answer:</b> <b>956 SNPs</b> in the region.</p>
<hr>


'''Q4'''
<h3>Q4</h3>


<p><b>Variant at chr20:32011209</b></p>


<pre>
<pre>
tabix NA24694.vcf.gz chr20:32011209-32011209
tabix NA24694.vcf.gz chr20:32011209-32011209
</pre>
</pre>
You get:
 
<pre>
<pre>
chr20 32011209 rs147652161 G A 264.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.010e-01;DB;DP=24;ExcessHet=3.0103;FS=3.949;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=11.03;ReadPosRankSum=-3.580e-01;SOR=0.552 GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
chr20 32011209 rs147652161 G A 264.64 .
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552
GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
</pre>
</pre>


The line above says that you probably have G and A and the site is heterozygous. G is the reference and A the alternative allele. The allele depth is 15Gs, 9As, the depth is 24, the genotype quality is 99, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 272,0,533
<p><b>Interpretation</b>:</p>
<ul>
  <li>Genotype: <b>0/1</b> (heterozygous G/A)</li>
  <li>Allele depth (AD): <b>15 G</b> and <b>9 A</b></li>
  <li>Depth (DP): <b>24</b></li>
  <li>Genotype quality (GQ): <b>99</b></li>
  <li>Genotype likelihoods (PL): <b>272,0,533</b> (het most likely)</li>
</ul>
 
<p><b>Variant at chr20:32044279</b></p>


<pre>
<pre>
tabix NA24694.vcf.gz chr20:32044279-32044279
tabix NA24694.vcf.gz chr20:32044279-32044279
</pre>
</pre>
You get:
 
<pre>
<pre>
chr20 32044279 rs4525768 C T 799.06 . AC=2;AF=1.00;AN=2;DB;DP=21;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.99;SOR=0.990 GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
chr20 32044279 rs4525768 C T 799.06 .
AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99
GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
</pre>
</pre>


The line above says that you are probably homozygous for T. C is the reference and T the alternative allele. The allele depth is 0Cs, 21Ts, the depth is 21, the genotype quality is 63, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 813,63,0.
<p><b>Interpretation</b>:</p>
<ul>
  <li>Genotype: <b>1/1</b> (homozygous T/T)</li>
  <li>Allele depth (AD): <b>0 C</b>, <b>21 T</b></li>
  <li>Depth (DP): <b>21</b></li>
  <li>Genotype quality (GQ): <b>63</b></li>
  <li>Genotype likelihoods (PL): <b>813,63,0</b> (homozygous alt most likely)</li>
</ul>


<hr>


'''Q5'''
<h3>Q5</h3>
 
<p><b>Higher-quality heterozygous SNP</b>:</p>


Both are heterozygous sites however this is the better one:
<pre>
<pre>
chr20 32974911 rs6088051 A G 403.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.693;DB;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.78;MQRankSum=-6.860e-01;QD=19.22;ReadPosRankSum=-1.703e+00;SOR=0.871 GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
</pre>
</pre>


and the worse one:
<p><b>Lower-quality heterozygous SNP</b>:</p>
 
<pre>
<pre>
chr20 64291638 rs369221086 C T 30.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.530e-01;DB;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.18;MQRankSum=-9.670e-01;QD=5.11;ReadPosRankSum=-8.420e-01;SOR=0.693 GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
</pre>
</pre>


The first site has a depth of 21, 8As and 13Gs. In the second case, the genotype quality is 38 and the first was 99. The second one only has 4Cs and 2 Ts for a total depth of 6 which is not sufficient to confidently call a heterozygous site.
<p><b>Why?</b></p>
<ul>
  <li>The first site has <b>much higher depth</b> (21× vs 6×)</li>
  <li>Allele balance is more reasonable: 8/13 vs 4/2</li>
  <li>The genotype quality (GQ) is much higher: <b>99 vs 38</b></li>
  <li>Overall likelihoods strongly support the correct genotype</li>
</ul>


 
<p><b>Conclusion:</b> More data ⇒ higher confidence.</p>
'''Q6'''
 
<hr>
 
<h3>Q6</h3>
 
<p>Count SNPs with no dbSNP ID (column 3 = <code>.</code>):</p>


Either use:
<pre>
<pre>
bcftools view -H --types=snps NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep -v rs |wc -l
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep -v rs \
  | wc -l
</pre>
</pre>


or  
<p>or equivalently:</p>


<pre>
<pre>
bcftools view -H --types=snps NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep "\." |wc -l
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep "\." \
  | wc -l
</pre>
</pre>


you will get 17.
<p><b>Answer:</b> <b>17 novel SNPs</b></p>
 
<hr>
 
<h3>Q7</h3>
 
<p>You found:</p>
<ul>
  <li><b>956 total SNPs</b> in the region (Q3)</li>
  <li><b>17 novel SNPs</b> (Q6)</li>
</ul>


<p><b>Only ~1.8% of SNPs are novel.</b></p>


'''Q7'''
<p>This is expected because:</p>
<ul>
  <li>Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.</li>
  <li>dbSNP contains >100 million known variants, so most common variation is already catalogued.</li>
  <li>Novel variants tend to be rare or extremely rare.</li>
</ul>


17 is very little compared to the number of SNPs 956. However, this is very expected. Given that the individual is Han Chinese, this ethnic group is very well represented in dbSNP.
<p><b>Conclusion:</b> The number of novel variants is small and biologically reasonable.</p>

Latest revision as of 14:51, 25 November 2025

Answers

Q1

First index the gVCF:

tabix -f -p vcf NA24694.gvcf.gz

Then run HaplotypeCaller:

gatk --java-options "-Xmx10g" HaplotypeCaller \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
  -L chr20 \
  -O NA24694.gvcf.gz \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -ERC GVCF

Then genotype the gVCF:

gatk GenotypeGVCFs \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -V NA24694.gvcf.gz \
  -L chr20 \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -O NA24694.vcf.gz

Finally, count SNPs using bcftools stats:

bcftools stats NA24694.vcf.gz

You should see something like:

SN    0    number of SNPs:    75684

Answer: There are 75,684 SNPs on chromosome 20 in this sample.


Q2

Index the VCF if not already indexed:

tabix -p vcf NA24694.vcf.gz

Then query the 1 Mb region:

tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l

Answer: 1290 variant sites in this region (all variant types).


Q3

Count only SNPs (exclude indels and multi-allelic sites):

bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l

or equivalently:

tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
  | bcftools view -H --type snps - \
  | wc -l

Answer: 956 SNPs in the region.


Q4

Variant at chr20:32011209

tabix NA24694.vcf.gz chr20:32011209-32011209
chr20  32011209  rs147652161  G  A  264.64  .  
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552  
GT:AD:DP:GQ:PL  0/1:15,9:24:99:272,0,533

Interpretation:

  • Genotype: 0/1 (heterozygous G/A)
  • Allele depth (AD): 15 G and 9 A
  • Depth (DP): 24
  • Genotype quality (GQ): 99
  • Genotype likelihoods (PL): 272,0,533 (het most likely)

Variant at chr20:32044279

tabix NA24694.vcf.gz chr20:32044279-32044279
chr20  32044279  rs4525768  C  T  799.06  .  
AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99  
GT:AD:DP:GQ:PL  1/1:0,21:21:63:813,63,0

Interpretation:

  • Genotype: 1/1 (homozygous T/T)
  • Allele depth (AD): 0 C, 21 T
  • Depth (DP): 21
  • Genotype quality (GQ): 63
  • Genotype likelihoods (PL): 813,63,0 (homozygous alt most likely)

Q5

Higher-quality heterozygous SNP:

chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL  0/1:8,13:21:99:411,0,247

Lower-quality heterozygous SNP:

chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL  0/1:4,2:6:38:38,0,114

Why?

  • The first site has much higher depth (21× vs 6×)
  • Allele balance is more reasonable: 8/13 vs 4/2
  • The genotype quality (GQ) is much higher: 99 vs 38
  • Overall likelihoods strongly support the correct genotype

Conclusion: More data ⇒ higher confidence.


Q6

Count SNPs with no dbSNP ID (column 3 = .):

bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep -v rs \
  | wc -l

or equivalently:

bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep "\." \
  | wc -l

Answer: 17 novel SNPs


Q7

You found:

  • 956 total SNPs in the region (Q3)
  • 17 novel SNPs (Q6)

Only ~1.8% of SNPs are novel.

This is expected because:

  • Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.
  • dbSNP contains >100 million known variants, so most common variation is already catalogued.
  • Novel variants tend to be rare or extremely rare.

Conclusion: The number of novel variants is small and biologically reasonable.