SNP calling exercise answers part 2
Q1
First running:
gatk VariantFiltration -V NA24694.vcf.gz -O NA24694_hf.vcf.gz -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40"
and
bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |wc -l
Gives us 4005 sites.
bcftools view -H --type=snps NA24694_hf.vcf.gz |grep -v PASS |wc -l
2630 SNPs
Q2
One possibility is:
bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |cut -f 7 |sort |uniq -c |sort -n
5 FS60;SOR3
30 DP;MQ40;SOR3
74 MQ40;SOR3
158 DP;SOR3
197 DP;MQ40
390 MQ40
1340 SOR3
1811 DP
This says remove all lines with the string "PASS", extract the seventh column, sort them, unique and count them, sort again but according to numerical order. At the bottom, you have the most used filter which is depth of coverage.
Q3
Initially, we isolate the ones that pass the filter:
bcftools view -f PASS NA24694_hf.vcf.gz |bgzip -c > NA24694_hf_pass.vcf.gz bcftools view -H NA24694_hf_pass.vcf.gz |wc -l
88594 total sites (SNPS+indels+multi-allelic).
Then we retain the sites using bedtools:
bedtools intersect -header -a NA24694_hf_pass.vcf.gz -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > NA24694_hf_map99.vcf.gz
bcftools view -H NA24694_hf_map99.vcf.gz |wc -l
51624 total sites remain
Q4 Using:
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats NA24694_hf.html GRCh38.99 NA24694_hf.vcf.gz |bgzip -c > NA24694_hf_ann.vcf.gz
In the HTML file you see: Intron 64.368%
Q5 In the HTML file you see:
MISSENSE 584 44.242%
So a total of 584 detecting mutations can have an impact on the protein sequence.