SNP calling exercise part 2 answers
Q1
First running:
gatk VariantFiltration -V NA24694.vcf.gz -O NA24694_hf.vcf.gz -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40"
and
bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |wc -l
Gives us 4005 sites.
bcftools view -H --type=snps NA24694_hf.vcf.gz |grep -v PASS |wc -l
2630 SNPs
Q2
One possibility is:
bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |cut -f 7 |sort |uniq -c |sort -n 5 FS60;SOR3 30 DP;MQ40;SOR3 74 MQ40;SOR3 158 DP;SOR3 197 DP;MQ40 390 MQ40 1340 SOR3 1811 DP
This says remove all lines with the string "PASS", extract the seventh column, sort them, unique and count them, sort again but according to numerical order. At the bottom, you have the most used filter which is depth of coverage.
Q3
Initially, we isolate the ones that pass the filter:
bcftools view -f PASS NA24694_hf.vcf.gz |bgzip -c > NA24694_hf_pass.vcf.gz bcftools view -H NA24694_hf_pass.vcf.gz |wc -l
88594 total sites (SNPS+indels+multi-allelic).
Then we retain the sites using bedtools:
bedtools intersect -header -a NA24694_hf_pass.vcf.gz -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > NA24694_hf_map99.vcf.gz
bcftools view -H NA24694_hf_map99.vcf.gz |wc -l
51624 total sites remain
Q4 Using:
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats NA24694_hf.html GRCh38.99 NA24694_hf.vcf.gz |bgzip -c > NA24694_hf_ann.vcf.gz
In the HTML file you see: Intron 64.368%
Q5 In the HTML file you see:
MISSENSE 584 44.242%
So a total of 584 detecting mutations can have an impact on the protein sequence.