SNP calling exercise answers part 2

From 22126
Revision as of 10:15, 26 November 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Answers

Q1 — How many sites were filtered out?

After running:

gatk VariantFiltration \
   -V NA24694.vcf.gz \
   -O NA24694_hf.vcf.gz \
   -filter "DP < 10.0"   --filter-name "DP" \
   -filter "QUAL < 30.0" --filter-name "QUAL30" \
   -filter "SOR > 3.0"   --filter-name "SOR3" \
   -filter "FS > 60.0"   --filter-name "FS60" \
   -filter "MQ < 40.0"   --filter-name "MQ40"

Count all filtered variants (non-PASS):

bcftools view -H NA24694_hf.vcf.gz | grep -v PASS | wc -l

4005 sites were filtered out (all variant types).

Count only SNPs filtered out:

bcftools view -H --type=snps NA24694_hf.vcf.gz | grep -v PASS | wc -l

2630 SNPs were filtered out.


Q2 — Which filter removed the most sites?

One approach:

bcftools view -H NA24694_hf.vcf.gz \
 | grep -v PASS \
 | cut -f7 \
 | sort \
 | uniq -c \
 | sort -n

This produces:

      5 FS60;SOR3
     30 DP;MQ40;SOR3
     74 MQ40;SOR3
    158 DP;SOR3
    197 DP;MQ40
    390 MQ40
   1340 SOR3
   1811 DP

Depth filter (DP) removed the most sites, followed by SOR (strand bias) and MQ (mapping quality).


Q3 — How many sites remain after applying the mappability filter?

First extract sites that passed all hard filters:

bcftools view -f PASS NA24694_hf.vcf.gz \
  | bgzip -c > NA24694_hf_pass.vcf.gz

Count variants that passed all filters:

bcftools view -H NA24694_hf_pass.vcf.gz | wc -l

88,594 total variants (SNPs + indels + multi-allelic sites).

Now retain only variants in high-mappability regions:

bedtools intersect -header \
  -a NA24694_hf_pass.vcf.gz \
  -b /home/databases/databases/GRCh38/filter99.bed.gz \
 | bgzip -c > NA24694_hf_map99.vcf.gz

Count remaining sites:

bcftools view -H NA24694_hf_map99.vcf.gz | wc -l

51,624 variants remain after mappability filtering.


Q4 — Most common genomic category (snpEff)

Run snpEff:

java -jar /home/ctools/snpEff/snpEff.jar eff \
   -dataDir /home/databases/databases/snpEff/ \
   -htmlStats NA24694_hf.html \
   GRCh38.99 \
   NA24694_hf.vcf.gz \
 | bgzip -c > NA24694_hf_ann.vcf.gz

From the snpEff HTML output:

Intron variants = 64.368%

Answer: Most variants fall in intronic regions.


Q5 — Variants that cause a codon change

From the snpEff HTML report:

MISSENSE = 584 variants (44.242%)

Answer: 584 missense mutations are predicted to change a codon and alter the protein sequence.


End of Answers