SNP calling exercise part 2 answers

From 22126
Jump to navigation Jump to search

Q1

First running:

gatk VariantFiltration     -V NA24694.vcf.gz  -O NA24694_hf.vcf.gz   -filter "DP < 10.0" --filter-name "DP"    -filter "QUAL < 30.0" --filter-name "QUAL30"     -filter "SOR > 3.0" --filter-name "SOR3"     -filter "FS > 60.0" --filter-name "FS60"     -filter "MQ < 40.0" --filter-name "MQ40"      
  

and

 bcftools view  -H NA24694_hf.vcf.gz |grep -v PASS |wc -l 

Gives us 4005 sites.

 bcftools view  -H  --type=snps NA24694_hf.vcf.gz |grep -v PASS |wc -l 

2630 SNPs


Q2

One possibility is:

 bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |cut -f 7  |sort |uniq -c  |sort -n 
      5 FS60;SOR3
     30 DP;MQ40;SOR3
     74 MQ40;SOR3
    158 DP;SOR3
    197 DP;MQ40
    390 MQ40
   1340 SOR3
   1811 DP

This says remove all lines with the string "PASS", extract the seventh column, sort them, unique and count them, sort again but according to numerical order. At the bottom, you have the most used filter which is depth of coverage.

Q3

Initially, we isolate the ones that pass the filter:

bcftools view  -f PASS NA24694_hf.vcf.gz |bgzip -c  > NA24694_hf_pass.vcf.gz
bcftools view -H NA24694_hf_pass.vcf.gz |wc -l 

88594 total sites (SNPS+indels+multi-allelic).

Then we retain the sites using bedtools:

bedtools intersect -header -a  NA24694_hf_pass.vcf.gz  -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > NA24694_hf_map99.vcf.gz
bcftools view -H NA24694_hf_map99.vcf.gz |wc -l 

51624 total sites remain


Q4 Using:

java -jar /home/ctools/snpEff/snpEff.jar  eff  -dataDir /home/databases/databases/snpEff/  -htmlStats NA24694_hf.html GRCh38.99 NA24694_hf.vcf.gz  |bgzip -c > NA24694_hf_ann.vcf.gz

In the HTML file you see: Intron 64.368%

Q5 In the HTML file you see:

MISSENSE 584 44.242%

So a total of 584 detecting mutations can have an impact on the protein sequence.