SNP calling exercise answers part 2: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with " '''Q1''' First running: <pre> gatk VariantFiltration -V NA24694.vcf.gz -O NA24694_hf.vcf.gz -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40" </pre> and <pre> bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |wc -l </pre> Gives us 4005 sites. <pre> bcftools view -H --type=...")
 
No edit summary
 
Line 1: Line 1:
<h2>Answers</h2>


'''Q1'''
<h3>Q1 — How many sites were filtered out?</h3>


First running:
<p>After running:</p>


<pre>
<pre>
gatk VariantFiltration     -V NA24694.vcf.gz -O NA24694_hf.vcf.gz   -filter "DP < 10.0" --filter-name "DP"    -filter "QUAL < 30.0" --filter-name "QUAL30"     -filter "SOR > 3.0" --filter-name "SOR3"     -filter "FS > 60.0" --filter-name "FS60"     -filter "MQ < 40.0" --filter-name "MQ40"    
gatk VariantFiltration \
 
  -V NA24694.vcf.gz \
  -O NA24694_hf.vcf.gz \
  -filter "DP < 10.0"   --filter-name "DP" \
   -filter "QUAL < 30.0" --filter-name "QUAL30" \
  -filter "SOR > 3.0"   --filter-name "SOR3" \
  -filter "FS > 60.0"   --filter-name "FS60" \
  -filter "MQ < 40.0"   --filter-name "MQ40"
</pre>
</pre>


and
<p>Count all filtered variants (non-PASS):</p>


<pre>
<pre>
bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |wc -l  
bcftools view -H NA24694_hf.vcf.gz | grep -v PASS | wc -l
</pre>
</pre>


Gives us 4005 sites.
<p><b>4005 sites</b> were filtered out (all variant types).</p>
 
<p>Count only SNPs filtered out:</p>


<pre>
<pre>
bcftools view -H --type=snps NA24694_hf.vcf.gz |grep -v PASS |wc -l  
bcftools view -H --type=snps NA24694_hf.vcf.gz | grep -v PASS | wc -l
</pre>
</pre>
2630 SNPs


<p><b>2630 SNPs</b> were filtered out.</p>
<hr>
<h3>Q2 — Which filter removed the most sites?</h3>
<p>One approach:</p>
<pre>
bcftools view -H NA24694_hf.vcf.gz \
| grep -v PASS \
| cut -f7 \
| sort \
| uniq -c \
| sort -n
</pre>


'''Q2'''
<p>This produces:</p>


One possibility is:
<pre>
<pre>
bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |cut -f 7  |sort |uniq -c  |sort -n
       5 FS60;SOR3
       5 FS60;SOR3
     30 DP;MQ40;SOR3
     30 DP;MQ40;SOR3
Line 38: Line 60:
</pre>
</pre>


This says remove all lines with the string "PASS", extract the seventh column, sort them, unique and count them, sort again but according to numerical order. At the bottom, you have the most used filter which is depth of coverage.
<p><b>Depth filter (DP)</b> removed the most sites, followed by <b>SOR</b> (strand bias) and <b>MQ</b> (mapping quality).</p>
 
<hr>
 
<h3>Q3 — How many sites remain after applying the mappability filter?</h3>
 
<p>First extract sites that passed all hard filters:</p>
 
<pre>
bcftools view -f PASS NA24694_hf.vcf.gz \
  | bgzip -c > NA24694_hf_pass.vcf.gz
</pre>


'''Q3'''
<p>Count variants that passed all filters:</p>


Initially, we isolate the ones that pass the filter:
<pre>
<pre>
bcftools view  -f PASS NA24694_hf.vcf.gz |bgzip -c  > NA24694_hf_pass.vcf.gz
bcftools view -H NA24694_hf_pass.vcf.gz | wc -l
bcftools view -H NA24694_hf_pass.vcf.gz |wc -l  
</pre>
</pre>
88594 total sites (SNPS+indels+multi-allelic).


Then we retain the sites using bedtools:
<p><b>88,594 total variants</b> (SNPs + indels + multi-allelic sites).</p>
 
<p>Now retain only variants in high-mappability regions:</p>
 
<pre>
<pre>
bedtools intersect -header -a NA24694_hf_pass.vcf.gz -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > NA24694_hf_map99.vcf.gz
bedtools intersect -header \
  -a NA24694_hf_pass.vcf.gz \
  -b /home/databases/databases/GRCh38/filter99.bed.gz \
| bgzip -c > NA24694_hf_map99.vcf.gz
</pre>
</pre>
<p>Count remaining sites:</p>


<pre>
<pre>
bcftools view -H NA24694_hf_map99.vcf.gz |wc -l  
bcftools view -H NA24694_hf_map99.vcf.gz | wc -l
</pre>
</pre>
51624 total sites remain


<p><b>51,624 variants</b> remain after mappability filtering.</p>
<hr>
<h3>Q4 — Most common genomic category (snpEff)</h3>


<p>Run snpEff:</p>


'''Q4'''
Using:
<pre>
<pre>
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats NA24694_hf.html GRCh38.99 NA24694_hf.vcf.gz  |bgzip -c > NA24694_hf_ann.vcf.gz
java -jar /home/ctools/snpEff/snpEff.jar eff \
  -dataDir /home/databases/databases/snpEff/ \
  -htmlStats NA24694_hf.html \
  GRCh38.99 \
  NA24694_hf.vcf.gz \
  | bgzip -c > NA24694_hf_ann.vcf.gz
</pre>
</pre>


In the HTML file you see:
<p>From the snpEff HTML output:</p>
Intron 64.368%  
 
<p><b>Intron variants = 64.368%</b></p>
 
<p><b>Answer:</b> Most variants fall in <b>intronic regions</b>.</p>
 
<hr>
 
<h3>Q5 — Variants that cause a codon change</h3>
 
<p>From the snpEff HTML report:</p>
 
<p><b>MISSENSE = 584 variants (44.242%)</b></p>


'''Q5'''
<p><b>Answer:</b> <b>584 missense mutations</b> are predicted to change a codon and alter the protein sequence.</p>
In the HTML file you see:


MISSENSE 584 44.242%
<hr>


So a total of 584 detecting mutations can have an impact on the protein sequence.
<p><b>End of Answers</b></p>

Latest revision as of 10:15, 26 November 2025

Answers

Q1 — How many sites were filtered out?

After running:

gatk VariantFiltration \
   -V NA24694.vcf.gz \
   -O NA24694_hf.vcf.gz \
   -filter "DP < 10.0"   --filter-name "DP" \
   -filter "QUAL < 30.0" --filter-name "QUAL30" \
   -filter "SOR > 3.0"   --filter-name "SOR3" \
   -filter "FS > 60.0"   --filter-name "FS60" \
   -filter "MQ < 40.0"   --filter-name "MQ40"

Count all filtered variants (non-PASS):

bcftools view -H NA24694_hf.vcf.gz | grep -v PASS | wc -l

4005 sites were filtered out (all variant types).

Count only SNPs filtered out:

bcftools view -H --type=snps NA24694_hf.vcf.gz | grep -v PASS | wc -l

2630 SNPs were filtered out.


Q2 — Which filter removed the most sites?

One approach:

bcftools view -H NA24694_hf.vcf.gz \
 | grep -v PASS \
 | cut -f7 \
 | sort \
 | uniq -c \
 | sort -n

This produces:

      5 FS60;SOR3
     30 DP;MQ40;SOR3
     74 MQ40;SOR3
    158 DP;SOR3
    197 DP;MQ40
    390 MQ40
   1340 SOR3
   1811 DP

Depth filter (DP) removed the most sites, followed by SOR (strand bias) and MQ (mapping quality).


Q3 — How many sites remain after applying the mappability filter?

First extract sites that passed all hard filters:

bcftools view -f PASS NA24694_hf.vcf.gz \
  | bgzip -c > NA24694_hf_pass.vcf.gz

Count variants that passed all filters:

bcftools view -H NA24694_hf_pass.vcf.gz | wc -l

88,594 total variants (SNPs + indels + multi-allelic sites).

Now retain only variants in high-mappability regions:

bedtools intersect -header \
  -a NA24694_hf_pass.vcf.gz \
  -b /home/databases/databases/GRCh38/filter99.bed.gz \
 | bgzip -c > NA24694_hf_map99.vcf.gz

Count remaining sites:

bcftools view -H NA24694_hf_map99.vcf.gz | wc -l

51,624 variants remain after mappability filtering.


Q4 — Most common genomic category (snpEff)

Run snpEff:

java -jar /home/ctools/snpEff/snpEff.jar eff \
   -dataDir /home/databases/databases/snpEff/ \
   -htmlStats NA24694_hf.html \
   GRCh38.99 \
   NA24694_hf.vcf.gz \
 | bgzip -c > NA24694_hf_ann.vcf.gz

From the snpEff HTML output:

Intron variants = 64.368%

Answer: Most variants fall in intronic regions.


Q5 — Variants that cause a codon change

From the snpEff HTML report:

MISSENSE = 584 variants (44.242%)

Answer: 584 missense mutations are predicted to change a codon and alter the protein sequence.


End of Answers