SNP calling exercise answers part 2
Answers
Q1 — How many sites were filtered out?
After running:
gatk VariantFiltration \ -V NA24694.vcf.gz \ -O NA24694_hf.vcf.gz \ -filter "DP < 10.0" --filter-name "DP" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "SOR > 3.0" --filter-name "SOR3" \ -filter "FS > 60.0" --filter-name "FS60" \ -filter "MQ < 40.0" --filter-name "MQ40"
Count all filtered variants (non-PASS):
bcftools view -H NA24694_hf.vcf.gz | grep -v PASS | wc -l
4005 sites were filtered out (all variant types).
Count only SNPs filtered out:
bcftools view -H --type=snps NA24694_hf.vcf.gz | grep -v PASS | wc -l
2630 SNPs were filtered out.
Q2 — Which filter removed the most sites?
One approach:
bcftools view -H NA24694_hf.vcf.gz \ | grep -v PASS \ | cut -f7 \ | sort \ | uniq -c \ | sort -n
This produces:
5 FS60;SOR3
30 DP;MQ40;SOR3
74 MQ40;SOR3
158 DP;SOR3
197 DP;MQ40
390 MQ40
1340 SOR3
1811 DP
Depth filter (DP) removed the most sites, followed by SOR (strand bias) and MQ (mapping quality).
Q3 — How many sites remain after applying the mappability filter?
First extract sites that passed all hard filters:
bcftools view -f PASS NA24694_hf.vcf.gz \ | bgzip -c > NA24694_hf_pass.vcf.gz
Count variants that passed all filters:
bcftools view -H NA24694_hf_pass.vcf.gz | wc -l
88,594 total variants (SNPs + indels + multi-allelic sites).
Now retain only variants in high-mappability regions:
bedtools intersect -header \ -a NA24694_hf_pass.vcf.gz \ -b /home/databases/databases/GRCh38/filter99.bed.gz \ | bgzip -c > NA24694_hf_map99.vcf.gz
Count remaining sites:
bcftools view -H NA24694_hf_map99.vcf.gz | wc -l
51,624 variants remain after mappability filtering.
Q4 — Most common genomic category (snpEff)
Run snpEff:
java -jar /home/ctools/snpEff/snpEff.jar eff \ -dataDir /home/databases/databases/snpEff/ \ -htmlStats NA24694_hf.html \ GRCh38.99 \ NA24694_hf.vcf.gz \ | bgzip -c > NA24694_hf_ann.vcf.gz
From the snpEff HTML output:
Intron variants = 64.368%
Answer: Most variants fall in intronic regions.
Q5 — Variants that cause a codon change
From the snpEff HTML report:
MISSENSE = 584 variants (44.242%)
Answer: 584 missense mutations are predicted to change a codon and alter the protein sequence.
End of Answers