SNP calling exercise answers part 2: Difference between revisions
(Created page with " '''Q1''' First running: <pre> gatk VariantFiltration -V NA24694.vcf.gz -O NA24694_hf.vcf.gz -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" </pre> and <pre> bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |wc -l </pre> Gives us 4005 sites. <pre> bcftools view -H --type=...") |
No edit summary |
||
| Line 1: | Line 1: | ||
<h2>Answers</h2> | |||
<h3>Q1 — How many sites were filtered out?</h3> | |||
<p>After running:</p> | |||
<pre> | <pre> | ||
gatk VariantFiltration | gatk VariantFiltration \ | ||
-V NA24694.vcf.gz \ | |||
-O NA24694_hf.vcf.gz \ | |||
-filter "DP < 10.0" --filter-name "DP" \ | |||
-filter "QUAL < 30.0" --filter-name "QUAL30" \ | |||
-filter "SOR > 3.0" --filter-name "SOR3" \ | |||
-filter "FS > 60.0" --filter-name "FS60" \ | |||
-filter "MQ < 40.0" --filter-name "MQ40" | |||
</pre> | </pre> | ||
<p>Count all filtered variants (non-PASS):</p> | |||
<pre> | <pre> | ||
bcftools view -H NA24694_hf.vcf.gz | grep -v PASS | wc -l | |||
</pre> | </pre> | ||
<p><b>4005 sites</b> were filtered out (all variant types).</p> | |||
<p>Count only SNPs filtered out:</p> | |||
<pre> | <pre> | ||
bcftools view -H --type=snps NA24694_hf.vcf.gz | grep -v PASS | wc -l | |||
</pre> | </pre> | ||
<p><b>2630 SNPs</b> were filtered out.</p> | |||
<hr> | |||
<h3>Q2 — Which filter removed the most sites?</h3> | |||
<p>One approach:</p> | |||
<pre> | |||
bcftools view -H NA24694_hf.vcf.gz \ | |||
| grep -v PASS \ | |||
| cut -f7 \ | |||
| sort \ | |||
| uniq -c \ | |||
| sort -n | |||
</pre> | |||
<p>This produces:</p> | |||
<pre> | <pre> | ||
5 FS60;SOR3 | 5 FS60;SOR3 | ||
30 DP;MQ40;SOR3 | 30 DP;MQ40;SOR3 | ||
| Line 38: | Line 60: | ||
</pre> | </pre> | ||
<p><b>Depth filter (DP)</b> removed the most sites, followed by <b>SOR</b> (strand bias) and <b>MQ</b> (mapping quality).</p> | |||
<hr> | |||
<h3>Q3 — How many sites remain after applying the mappability filter?</h3> | |||
<p>First extract sites that passed all hard filters:</p> | |||
<pre> | |||
bcftools view -f PASS NA24694_hf.vcf.gz \ | |||
| bgzip -c > NA24694_hf_pass.vcf.gz | |||
</pre> | |||
<p>Count variants that passed all filters:</p> | |||
<pre> | <pre> | ||
bcftools view -H NA24694_hf_pass.vcf.gz | wc -l | |||
bcftools view -H NA24694_hf_pass.vcf.gz |wc -l | |||
</pre> | </pre> | ||
<p><b>88,594 total variants</b> (SNPs + indels + multi-allelic sites).</p> | |||
<p>Now retain only variants in high-mappability regions:</p> | |||
<pre> | <pre> | ||
bedtools intersect -header -a | bedtools intersect -header \ | ||
-a NA24694_hf_pass.vcf.gz \ | |||
-b /home/databases/databases/GRCh38/filter99.bed.gz \ | |||
| bgzip -c > NA24694_hf_map99.vcf.gz | |||
</pre> | </pre> | ||
<p>Count remaining sites:</p> | |||
<pre> | <pre> | ||
bcftools view -H NA24694_hf_map99.vcf.gz |wc -l | bcftools view -H NA24694_hf_map99.vcf.gz | wc -l | ||
</pre> | </pre> | ||
<p><b>51,624 variants</b> remain after mappability filtering.</p> | |||
<hr> | |||
<h3>Q4 — Most common genomic category (snpEff)</h3> | |||
<p>Run snpEff:</p> | |||
<pre> | <pre> | ||
java -jar /home/ctools/snpEff/snpEff.jar | java -jar /home/ctools/snpEff/snpEff.jar eff \ | ||
-dataDir /home/databases/databases/snpEff/ \ | |||
-htmlStats NA24694_hf.html \ | |||
GRCh38.99 \ | |||
NA24694_hf.vcf.gz \ | |||
| bgzip -c > NA24694_hf_ann.vcf.gz | |||
</pre> | </pre> | ||
<p>From the snpEff HTML output:</p> | |||
Intron 64.368% | |||
<p><b>Intron variants = 64.368%</b></p> | |||
<p><b>Answer:</b> Most variants fall in <b>intronic regions</b>.</p> | |||
<hr> | |||
<h3>Q5 — Variants that cause a codon change</h3> | |||
<p>From the snpEff HTML report:</p> | |||
<p><b>MISSENSE = 584 variants (44.242%)</b></p> | |||
<p><b>Answer:</b> <b>584 missense mutations</b> are predicted to change a codon and alter the protein sequence.</p> | |||
<hr> | |||
<p><b>End of Answers</b></p> | |||
Latest revision as of 10:15, 26 November 2025
Answers
Q1 — How many sites were filtered out?
After running:
gatk VariantFiltration \ -V NA24694.vcf.gz \ -O NA24694_hf.vcf.gz \ -filter "DP < 10.0" --filter-name "DP" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "SOR > 3.0" --filter-name "SOR3" \ -filter "FS > 60.0" --filter-name "FS60" \ -filter "MQ < 40.0" --filter-name "MQ40"
Count all filtered variants (non-PASS):
bcftools view -H NA24694_hf.vcf.gz | grep -v PASS | wc -l
4005 sites were filtered out (all variant types).
Count only SNPs filtered out:
bcftools view -H --type=snps NA24694_hf.vcf.gz | grep -v PASS | wc -l
2630 SNPs were filtered out.
Q2 — Which filter removed the most sites?
One approach:
bcftools view -H NA24694_hf.vcf.gz \ | grep -v PASS \ | cut -f7 \ | sort \ | uniq -c \ | sort -n
This produces:
5 FS60;SOR3
30 DP;MQ40;SOR3
74 MQ40;SOR3
158 DP;SOR3
197 DP;MQ40
390 MQ40
1340 SOR3
1811 DP
Depth filter (DP) removed the most sites, followed by SOR (strand bias) and MQ (mapping quality).
Q3 — How many sites remain after applying the mappability filter?
First extract sites that passed all hard filters:
bcftools view -f PASS NA24694_hf.vcf.gz \ | bgzip -c > NA24694_hf_pass.vcf.gz
Count variants that passed all filters:
bcftools view -H NA24694_hf_pass.vcf.gz | wc -l
88,594 total variants (SNPs + indels + multi-allelic sites).
Now retain only variants in high-mappability regions:
bedtools intersect -header \ -a NA24694_hf_pass.vcf.gz \ -b /home/databases/databases/GRCh38/filter99.bed.gz \ | bgzip -c > NA24694_hf_map99.vcf.gz
Count remaining sites:
bcftools view -H NA24694_hf_map99.vcf.gz | wc -l
51,624 variants remain after mappability filtering.
Q4 — Most common genomic category (snpEff)
Run snpEff:
java -jar /home/ctools/snpEff/snpEff.jar eff \ -dataDir /home/databases/databases/snpEff/ \ -htmlStats NA24694_hf.html \ GRCh38.99 \ NA24694_hf.vcf.gz \ | bgzip -c > NA24694_hf_ann.vcf.gz
From the snpEff HTML output:
Intron variants = 64.368%
Answer: Most variants fall in intronic regions.
Q5 — Variants that cause a codon change
From the snpEff HTML report:
MISSENSE = 584 variants (44.242%)
Answer: 584 missense mutations are predicted to change a codon and alter the protein sequence.
End of Answers