SNP calling exercise answers part 1: Difference between revisions
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
<h2>Answers</h2> | |||
<h3>Q1</h3> | |||
<p>First index the gVCF:</p> | |||
<pre> | <pre> | ||
tabix -f -p vcf NA24694.gvcf.gz | tabix -f -p vcf NA24694.gvcf.gz | ||
</pre> | </pre> | ||
<p>Then run HaplotypeCaller:</p> | |||
<pre> | <pre> | ||
gatk --java-options "-Xmx10g" HaplotypeCaller | gatk --java-options "-Xmx10g" HaplotypeCaller \ | ||
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ | |||
-I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \ | |||
-L chr20 \ | |||
-O NA24694.gvcf.gz \ | |||
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ | |||
-ERC GVCF | |||
</pre> | </pre> | ||
<p>Then genotype the gVCF:</p> | |||
<pre> | <pre> | ||
gatk | gatk GenotypeGVCFs \ | ||
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ | |||
-V NA24694.gvcf.gz \ | |||
-L chr20 \ | |||
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ | |||
-O NA24694.vcf.gz | |||
</pre> | </pre> | ||
<p>Finally, count SNPs using <code>bcftools stats</code>:</p> | |||
<pre> | <pre> | ||
bcftools stats | bcftools stats NA24694.vcf.gz | ||
</pre> | </pre> | ||
<p>You should see something like:</p> | |||
<pre> | <pre> | ||
SN 0 number of SNPs: 75684 | SN 0 number of SNPs: 75684 | ||
</pre> | </pre> | ||
<p><b>Answer:</b> There are <b>75,684 SNPs</b> on chromosome 20 in this sample.</p> | |||
<hr> | |||
<h3>Q2</h3> | |||
<p>Index the VCF if not already indexed:</p> | |||
<pre> | <pre> | ||
tabix -p vcf NA24694.vcf.gz | tabix -p vcf NA24694.vcf.gz | ||
</pre> | </pre> | ||
Then: | <p>Then query the 1 Mb region:</p> | ||
<pre> | <pre> | ||
tabix | tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l | ||
</pre> | </pre> | ||
<p><b>Answer:</b> <b>1290 variant sites</b> in this region (all variant types).</p> | |||
<hr> | |||
<h3>Q3</h3> | |||
<p>Count only SNPs (exclude indels and multi-allelic sites):</p> | |||
<pre> | <pre> | ||
bcftools view -H --type | bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l | ||
</pre> | </pre> | ||
or | |||
<p>or equivalently:</p> | |||
<pre> | <pre> | ||
tabix -h NA24694.vcf.gz chr20:32000000-33000000 |bcftools view -H --type | tabix -h NA24694.vcf.gz chr20:32000000-33000000 \ | ||
| bcftools view -H --type snps - \ | |||
| wc -l | |||
</pre> | </pre> | ||
<p><b>Answer:</b> <b>956 SNPs</b> in the region.</p> | |||
<hr> | |||
<h3>Q4</h3> | |||
<p><b>Variant at chr20:32011209</b></p> | |||
<pre> | <pre> | ||
tabix NA24694.vcf.gz | tabix NA24694.vcf.gz chr20:32011209-32011209 | ||
</pre> | </pre> | ||
<pre> | <pre> | ||
chr20 32011209 rs147652161 G A 264.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=- | chr20 32011209 rs147652161 G A 264.64 . | ||
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552 | |||
GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533 | |||
</pre> | </pre> | ||
<p><b>Interpretation</b>:</p> | |||
<ul> | |||
<li>Genotype: <b>0/1</b> (heterozygous G/A)</li> | |||
<li>Allele depth (AD): <b>15 G</b> and <b>9 A</b></li> | |||
<li>Depth (DP): <b>24</b></li> | |||
<li>Genotype quality (GQ): <b>99</b></li> | |||
<li>Genotype likelihoods (PL): <b>272,0,533</b> (het most likely)</li> | |||
</ul> | |||
<p><b>Variant at chr20:32044279</b></p> | |||
<pre> | <pre> | ||
tabix NA24694.vcf.gz | tabix NA24694.vcf.gz chr20:32044279-32044279 | ||
</pre> | </pre> | ||
<pre> | <pre> | ||
chr20 32044279 rs4525768 C T 799.06 . AC=2;AF=1.00;AN=2;DB;DP=21 | chr20 32044279 rs4525768 C T 799.06 . | ||
AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99 | |||
GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0 | |||
</pre> | </pre> | ||
<p><b>Interpretation</b>:</p> | |||
<ul> | |||
<li>Genotype: <b>1/1</b> (homozygous T/T)</li> | |||
<li>Allele depth (AD): <b>0 C</b>, <b>21 T</b></li> | |||
<li>Depth (DP): <b>21</b></li> | |||
<li>Genotype quality (GQ): <b>63</b></li> | |||
<li>Genotype likelihoods (PL): <b>813,63,0</b> (homozygous alt most likely)</li> | |||
</ul> | |||
<hr> | |||
<h3>Q5</h3> | |||
<p><b>Higher-quality heterozygous SNP</b>:</p> | |||
<pre> | <pre> | ||
chr20 32974911 rs6088051 A G | chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247 | ||
</pre> | </pre> | ||
<p><b>Lower-quality heterozygous SNP</b>:</p> | |||
<pre> | <pre> | ||
chr20 64291638 rs369221086 C T | chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114 | ||
</pre> | </pre> | ||
The first site has | <p><b>Why?</b></p> | ||
<ul> | |||
<li>The first site has <b>much higher depth</b> (21× vs 6×)</li> | |||
<li>Allele balance is more reasonable: 8/13 vs 4/2</li> | |||
<li>The genotype quality (GQ) is much higher: <b>99 vs 38</b></li> | |||
<li>Overall likelihoods strongly support the correct genotype</li> | |||
</ul> | |||
<p><b>Conclusion:</b> More data ⇒ higher confidence.</p> | |||
<hr> | |||
<h3>Q6</h3> | |||
<p>Count SNPs with no dbSNP ID (column 3 = <code>.</code>):</p> | |||
<pre> | <pre> | ||
bcftools view -H --types | bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | ||
| cut -f 3 \ | |||
| grep -v rs \ | |||
| wc -l | |||
</pre> | </pre> | ||
or | <p>or equivalently:</p> | ||
<pre> | <pre> | ||
bcftools view -H --types | bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | ||
| cut -f 3 \ | |||
| grep "\." \ | |||
| wc -l | |||
</pre> | </pre> | ||
<p><b>Answer:</b> <b>17 novel SNPs</b></p> | |||
<hr> | |||
<h3>Q7</h3> | |||
<p>You found:</p> | |||
<ul> | |||
<li><b>956 total SNPs</b> in the region (Q3)</li> | |||
<li><b>17 novel SNPs</b> (Q6)</li> | |||
</ul> | |||
<p><b>Only ~1.8% of SNPs are novel.</b></p> | |||
<p>This is expected because:</p> | |||
<ul> | |||
<li>Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.</li> | |||
<li>dbSNP contains >100 million known variants, so most common variation is already catalogued.</li> | |||
<li>Novel variants tend to be rare or extremely rare.</li> | |||
</ul> | |||
<p><b>Conclusion:</b> The number of novel variants is small and biologically reasonable.</p> | |||
Latest revision as of 14:51, 25 November 2025
Answers
Q1
First index the gVCF:
tabix -f -p vcf NA24694.gvcf.gz
Then run HaplotypeCaller:
gatk --java-options "-Xmx10g" HaplotypeCaller \ -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \ -L chr20 \ -O NA24694.gvcf.gz \ --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -ERC GVCF
Then genotype the gVCF:
gatk GenotypeGVCFs \ -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -V NA24694.gvcf.gz \ -L chr20 \ --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -O NA24694.vcf.gz
Finally, count SNPs using bcftools stats:
bcftools stats NA24694.vcf.gz
You should see something like:
SN 0 number of SNPs: 75684
Answer: There are 75,684 SNPs on chromosome 20 in this sample.
Q2
Index the VCF if not already indexed:
tabix -p vcf NA24694.vcf.gz
Then query the 1 Mb region:
tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l
Answer: 1290 variant sites in this region (all variant types).
Q3
Count only SNPs (exclude indels and multi-allelic sites):
bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l
or equivalently:
tabix -h NA24694.vcf.gz chr20:32000000-33000000 \ | bcftools view -H --type snps - \ | wc -l
Answer: 956 SNPs in the region.
Q4
Variant at chr20:32011209
tabix NA24694.vcf.gz chr20:32011209-32011209
chr20 32011209 rs147652161 G A 264.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552 GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
Interpretation:
- Genotype: 0/1 (heterozygous G/A)
- Allele depth (AD): 15 G and 9 A
- Depth (DP): 24
- Genotype quality (GQ): 99
- Genotype likelihoods (PL): 272,0,533 (het most likely)
Variant at chr20:32044279
tabix NA24694.vcf.gz chr20:32044279-32044279
chr20 32044279 rs4525768 C T 799.06 . AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99 GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
Interpretation:
- Genotype: 1/1 (homozygous T/T)
- Allele depth (AD): 0 C, 21 T
- Depth (DP): 21
- Genotype quality (GQ): 63
- Genotype likelihoods (PL): 813,63,0 (homozygous alt most likely)
Q5
Higher-quality heterozygous SNP:
chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
Lower-quality heterozygous SNP:
chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
Why?
- The first site has much higher depth (21× vs 6×)
- Allele balance is more reasonable: 8/13 vs 4/2
- The genotype quality (GQ) is much higher: 99 vs 38
- Overall likelihoods strongly support the correct genotype
Conclusion: More data ⇒ higher confidence.
Q6
Count SNPs with no dbSNP ID (column 3 = .):
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | cut -f 3 \ | grep -v rs \ | wc -l
or equivalently:
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | cut -f 3 \ | grep "\." \ | wc -l
Answer: 17 novel SNPs
Q7
You found:
- 956 total SNPs in the region (Q3)
- 17 novel SNPs (Q6)
Only ~1.8% of SNPs are novel.
This is expected because:
- Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.
- dbSNP contains >100 million known variants, so most common variation is already catalogued.
- Novel variants tend to be rare or extremely rare.
Conclusion: The number of novel variants is small and biologically reasonable.