SNP calling exercise answers part 1: Difference between revisions

Latest revision as of 14:51, 25 November 2025

Answers

Q1

First index the gVCF:

tabix -f -p vcf NA24694.gvcf.gz

Then run HaplotypeCaller:

gatk --java-options "-Xmx10g" HaplotypeCaller \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
  -L chr20 \
  -O NA24694.gvcf.gz \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -ERC GVCF

Then genotype the gVCF:

gatk GenotypeGVCFs \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -V NA24694.gvcf.gz \
  -L chr20 \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -O NA24694.vcf.gz

Finally, count SNPs using bcftools stats:

bcftools stats NA24694.vcf.gz

You should see something like:

SN    0    number of SNPs:    75684

Answer: There are 75,684 SNPs on chromosome 20 in this sample.

Q2

Index the VCF if not already indexed:

tabix -p vcf NA24694.vcf.gz

Then query the 1 Mb region:

tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l

Answer: 1290 variant sites in this region (all variant types).

Q3

Count only SNPs (exclude indels and multi-allelic sites):

bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l

or equivalently:

tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
  | bcftools view -H --type snps - \
  | wc -l

Answer: 956 SNPs in the region.

Q4

Variant at chr20:32011209

tabix NA24694.vcf.gz chr20:32011209-32011209

chr20  32011209  rs147652161  G  A  264.64  .  
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552  
GT:AD:DP:GQ:PL  0/1:15,9:24:99:272,0,533

Interpretation:

Genotype: 0/1 (heterozygous G/A)
Allele depth (AD): 15 G and 9 A
Depth (DP): 24
Genotype quality (GQ): 99
Genotype likelihoods (PL): 272,0,533 (het most likely)

Variant at chr20:32044279

tabix NA24694.vcf.gz chr20:32044279-32044279

chr20  32044279  rs4525768  C  T  799.06  .  
AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99  
GT:AD:DP:GQ:PL  1/1:0,21:21:63:813,63,0

Interpretation:

Genotype: 1/1 (homozygous T/T)
Allele depth (AD): 0 C, 21 T
Depth (DP): 21
Genotype quality (GQ): 63
Genotype likelihoods (PL): 813,63,0 (homozygous alt most likely)

Q5

Higher-quality heterozygous SNP:

chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL  0/1:8,13:21:99:411,0,247

Lower-quality heterozygous SNP:

chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL  0/1:4,2:6:38:38,0,114

Why?

The first site has much higher depth (21× vs 6×)
Allele balance is more reasonable: 8/13 vs 4/2
The genotype quality (GQ) is much higher: 99 vs 38
Overall likelihoods strongly support the correct genotype

Conclusion: More data ⇒ higher confidence.

Q6

Count SNPs with no dbSNP ID (column 3 = .):

bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep -v rs \
  | wc -l

or equivalently:

bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep "\." \
  | wc -l

Answer: 17 novel SNPs

Q7

You found:

956 total SNPs in the region (Q3)
17 novel SNPs (Q6)

Only ~1.8% of SNPs are novel.

This is expected because:

Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.
dbSNP contains >100 million known variants, so most common variation is already catalogued.
Novel variants tend to be rare or extremely rare.

Conclusion: The number of novel variants is small and biologically reasonable.

SNP calling exercise answers part 1: Difference between revisions

Latest revision as of 14:51, 25 November 2025

Contents

Answers

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Navigation menu

@@ Line 1: / Line 1: @@
-'''Q1'''
+<h2>Answers</h2>
+<h3>Q1</h3>
+<p>First index the gVCF:</p>
-First, running:
 <pre>
 tabix -f -p vcf NA24694.gvcf.gz
 </pre>
-then
+<p>Then run HaplotypeCaller:</p>
 <pre>
-gatk --java-options "-Xmx10g" HaplotypeCaller    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa    -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam  -L chr20  -O NA24694.gvcf.gz  --dbsnp  /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz  -ERC GVCF
+gatk --java-options "-Xmx10g" HaplotypeCaller \
+  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
+  -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
+  -L chr20 \
+  -O NA24694.gvcf.gz \
+  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
+  -ERC GVCF
 </pre>
+<p>Then genotype the gVCF:</p>
 <pre>
-gatk  GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V NA24694.gvcf.gz  -L chr20 --dbsnp  /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -O NA24694.vcf.gz
+gatk GenotypeGVCFs \
+  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
+  -V NA24694.gvcf.gz \
+  -L chr20 \
+  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
+  -O NA24694.vcf.gz
 </pre>
+<p>Finally, count SNPs using <code>bcftools stats</code>:</p>
 <pre>
-bcftools stats  NA24694.vcf.gz
+bcftools stats NA24694.vcf.gz
 </pre>
-Should give you:
+<p>You should see something like:</p>
 <pre>
-SN	0	number of SNPs:	75684
+SN    0    number of SNPs:    75684
 </pre>
-so 75684 SNPs.
+<p><b>Answer:</b> There are <b>75,684 SNPs</b> on chromosome 20 in this sample.</p>
+<hr>
+<h3>Q2</h3>
-'''Q2'''
+<p>Index the VCF if not already indexed:</p>
-First run:
 <pre>
 tabix -p vcf NA24694.vcf.gz
 </pre>
-Then:
+<p>Then query the 1 Mb region:</p>
 <pre>
-tabix  NA24694.vcf.gz chr20:32000000-33000000 |wc -l
+tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l
 </pre>
-So 1290 variant sites.
-'''Q3'''
+<p><b>Answer:</b> <b>1290 variant sites</b> in this region (all variant types).</p>
+<hr>
+<h3>Q3</h3>
+<p>Count only SNPs (exclude indels and multi-allelic sites):</p>
-There are 2 ways to do this:
 <pre>
-bcftools view -H --type=snps NA24694.vcf.gz  chr20:32000000-33000000 |wc -l
+bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l
 </pre>
-or
+<p>or equivalently:</p>
 <pre>
-tabix -h NA24694.vcf.gz chr20:32000000-33000000 |bcftools view -H --type=snps -  |wc -l
+tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
+  | bcftools view -H --type snps - \
+  | wc -l
 </pre>
-Both will give you: 956.
+<p><b>Answer:</b> <b>956 SNPs</b> in the region.</p>
+<hr>
-'''Q4'''
+<h3>Q4</h3>
+<p><b>Variant at chr20:32011209</b></p>
 <pre>
-tabix NA24694.vcf.gz  chr20:32011209-32011209
+tabix NA24694.vcf.gz chr20:32011209-32011209
 </pre>
-You get:
 <pre>
-chr20	32011209	rs147652161	G	A	264.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=-3.010e-01;DB;DP=24;ExcessHet=3.0103;FS=3.949;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=11.03;ReadPosRankSum=-3.580e-01;SOR=0.552	GT:AD:DP:GQ:PL	0/1:15,9:24:99:272,0,533
+chr20  32011209  rs147652161  G  A  264.64  .
+AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552
+GT:AD:DP:GQ:PL  0/1:15,9:24:99:272,0,533
 </pre>
-The line above says that you probably have G and A and the site is heterozygous. G is the reference and A the alternative allele. The allele depth is 15Gs, 9As, the depth is 24, the genotype quality is 99, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 272,0,533
+<p><b>Interpretation</b>:</p>
+<ul>
+  <li>Genotype: <b>0/1</b> (heterozygous G/A)</li>
+  <li>Allele depth (AD): <b>15 G</b> and <b>9 A</b></li>
+  <li>Depth (DP): <b>24</b></li>
+  <li>Genotype quality (GQ): <b>99</b></li>
+  <li>Genotype likelihoods (PL): <b>272,0,533</b> (het most likely)</li>
+</ul>
+<p><b>Variant at chr20:32044279</b></p>
 <pre>
-tabix NA24694.vcf.gz  chr20:32044279-32044279
+tabix NA24694.vcf.gz chr20:32044279-32044279
 </pre>
-You get:
 <pre>
-chr20	32044279	rs4525768	C	T	799.06	.	AC=2;AF=1.00;AN=2;DB;DP=21;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.99;SOR=0.990	GT:AD:DP:GQ:PL	1/1:0,21:21:63:813,63,0
+chr20  32044279  rs4525768  C  T  799.06  .
+AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99
+GT:AD:DP:GQ:PL  1/1:0,21:21:63:813,63,0
 </pre>
-The line above says that you are probably homozygous for T. C is the reference and T the alternative allele. The allele depth is 0Cs, 21Ts, the depth is 21, the genotype quality is 63, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 813,63,0.
+<p><b>Interpretation</b>:</p>
+<ul>
+  <li>Genotype: <b>1/1</b> (homozygous T/T)</li>
+  <li>Allele depth (AD): <b>0 C</b>, <b>21 T</b></li>
+  <li>Depth (DP): <b>21</b></li>
+  <li>Genotype quality (GQ): <b>63</b></li>
+  <li>Genotype likelihoods (PL): <b>813,63,0</b> (homozygous alt most likely)</li>
+</ul>
+<hr>
-'''Q5'''
+<h3>Q5</h3>
+<p><b>Higher-quality heterozygous SNP</b>:</p>
-Both are heterozygous sites however this is the better one:
 <pre>
-chr20	32974911	rs6088051	A	G	403.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.693;DB;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.78;MQRankSum=-6.860e-01;QD=19.22;ReadPosRankSum=-1.703e+00;SOR=0.871	GT:AD:DP:GQ:PL	0/1:8,13:21:99:411,0,247
+chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL  0/1:8,13:21:99:411,0,247
 </pre>
-and the worse one:
+<p><b>Lower-quality heterozygous SNP</b>:</p>
 <pre>
-chr20	64291638	rs369221086	C	T	30.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=-2.530e-01;DB;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.18;MQRankSum=-9.670e-01;QD=5.11;ReadPosRankSum=-8.420e-01;SOR=0.693	GT:AD:DP:GQ:PL	0/1:4,2:6:38:38,0,114
+chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL  0/1:4,2:6:38:38,0,114
 </pre>
-The first site has a depth of 21, 8As and 13Gs. In the second case, the genotype quality is 38 and the first was 99. The second one only has 4Cs and 2 Ts for a total depth of 6 which is not sufficient to confidently call a heterozygous site.
+<p><b>Why?</b></p>
+<ul>
+  <li>The first site has <b>much higher depth</b> (21× vs 6×)</li>
+  <li>Allele balance is more reasonable: 8/13 vs 4/2</li>
+  <li>The genotype quality (GQ) is much higher: <b>99 vs 38</b></li>
+  <li>Overall likelihoods strongly support the correct genotype</li>
+</ul>
+<p><b>Conclusion:</b> More data ⇒ higher confidence.</p>
-'''Q6'''
+<hr>
+<h3>Q6</h3>
+<p>Count SNPs with no dbSNP ID (column 3 = <code>.</code>):</p>
-Either use:
 <pre>
-bcftools view -H --types=snps  NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep -v rs |wc -l
+bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
+  | cut -f 3 \
+  | grep -v rs \
+  | wc -l
 </pre>
-or
+<p>or equivalently:</p>
 <pre>
-bcftools view -H --types=snps  NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep "\." |wc -l
+bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
+  | cut -f 3 \
+  | grep "\." \
+  | wc -l
 </pre>
-you will get 17.
+<p><b>Answer:</b> <b>17 novel SNPs</b></p>
+<hr>
+<h3>Q7</h3>
+<p>You found:</p>
+<ul>
+  <li><b>956 total SNPs</b> in the region (Q3)</li>
+  <li><b>17 novel SNPs</b> (Q6)</li>
+</ul>
+<p><b>Only ~1.8% of SNPs are novel.</b></p>
-'''Q7'''
+<p>This is expected because:</p>
+<ul>
+  <li>Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.</li>
+  <li>dbSNP contains >100 million known variants, so most common variation is already catalogued.</li>
+  <li>Novel variants tend to be rare or extremely rare.</li>
+</ul>
-is very little compared to the number of SNPs 956. However, this is very expected. Given that the individual is Han Chinese, this ethnic group is very well represented in dbSNP.
+<p><b>Conclusion:</b> The number of novel variants is small and biologically reasonable.</p>

SNP calling exercise answers part 1: Difference between revisions

Latest revision as of 14:51, 25 November 2025

Answers

Q1

Q2

Q3

Q4

Q5

Q6

Q7

Navigation menu

Search