SNP calling exercise part 2
Filtering
We saw that the data contains some calls of poor quality. Ideally, we do not want to carry over these calls into downstream analyses. We will explore how to filter out genotype data.
Hard filtering
A major downside of soft filtering is that does not apply for samples for which we do not have a good representation of the genetic diversity. An alternative is to do hard filtering meaning filtering the variants according to predetermined cutoffs. This has the downside of potentially introducing bias if the filter is correlated with the type of variant (ex: if heterozygous sites have higher genotype quality).
First, we will consider the following mask file:
/home/databases/databases/GRCh38/mask99.bed.gz
It is in BED format which is for genomic intervals. These files are extensively used in next-generation sequencing analyses, have a look at the first lines, the format is:
[chromosome] [start coordinate (0-based)] [end coordinate (1-based)]
0-based means that the first base is at coordinates 0 (i.e 0 1 2 3 ...), 1-based means that the first base is at coordinate 1 (i.e 1 2 3 4...).
This mask file contains a set of genomic regions that you want to remove from downstream analyses. It is important to note that most genotypers do not take into account the fact that a genomic region can be duplicated. This is why it's that a mappability filter is also a good idea to use.
gatk VariantFiltration -V [INPUT VCF] -O [OUTPUT VCF] -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40"
This is what the different filter mean:
Filter | Meaning |
-filter "DP < 10.0" | sites with less than 10X depth of coverage are removed |
-filter "QUAL < 30.0" | sites with a variant quality less than 30 are removed (see the difference between variant quality and genotype quality (GQ) here ) |
-filter "SOR > 3.0" | sites with a strand odds ratio less than 3.0 are removed (strand bias is when a strand is favored over the other, read more here ) |
-filter "FS > 60.0" | sites with a Fisher's exact test (FS) for strand bias is greater than 60 are removed (read more about FS here ). |
-filter "MQ < 40.0" | sites where the median mapping quality of reads supporting is less than 40 are removed. |
In real life, there are no perfect filters, I suggest you progressively add them, measure their effectiveness and make sure that you do not introduce unwanted bias to your analyses.
Q11
How many sites have been filtered out? Remember that sites that passed the filter will have the string PASS as the seventh column, you can use grep with the trick we mentioned to remove lines with a specific string. How many SNPs were filtered out?
Q12
The filtering command leaves in the seventh column a string (specified via --filter-name above) to determine which filter failed sites that were removed. Can you use the command above and modify it using the commands like cut, sort and uniq to determine was the filter that filtered out the most sites? If you are starting to use UNIX, this question might be challenging.
Let's further remove sites that are in genomic regions of poor mappability. We can use bedtools which is a set of utilities to deal with BED files (merge, intersect, etc).
bedtools intersect -header -a [INPUT VCF] -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > [OUTPUT VCF]
The "-header" just says print the header of [INPUT VCF]. Retain the sites that have passed the hard filtering that are contained in filter99.bed.gz and call the output NA24694_hf_map99.vcf.gz. The 99 is from the percentage of DNA fragments required to map uniquely at that particular position (read more here.
Q13
How many sites did you retain?
Annotation of variants
An interesting question is always would was the genomic context of the variants, are they in introns, exons, intergenic. We will use a program called snpEff to characterize the different genomic variants that we found. You can run the program as such:
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats [OUTPUT HTML] GRCh38.99 [INPUT VCF] |bgzip -c > [OUTPUT VCF]
The -dataDir option specifies where to find the data. A human gene database was downloaded here: /home/databases/databases/snpEff/. "GRCh38.99" represents a specific version (hg38) of the human genome. As mentioned in previous exercises, be very careful to use the exact same version of the genome in your analyses.
Run it on file resulting from hard-filtering prior to the filtering by mappability (NA24694_hf.vcf.gz), create an output HTML named: NA24694_hf.html and a VCF output named: NA24694_hf_ann.vcf.gz.
Use firefox to open the HTML report:
firefox NA24694_hf.html
and answer the following questions:
Q14
What is the most common genomic region (exon, downstream, intron, UTR) of the variants we detected?
Q15
How many variants can lead to a codon change? See explanation about point mutations on Wikipedia
Please find answers here
Congratulations you finished the exercise!
Note: in these exercises and answers, sometimes piped the output of "bcftools view" into other programs, ideally you should use the flag:
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
and think about which one is appropriate for your situation whether you're storing data or piping to another program.