SNP calling exercise part 2
Filtering
We have seen that the VCF contains some low-quality or unreliable variant calls. Before downstream analyses, we generally want to remove poor-quality sites or annotate them so they can be excluded later. In this exercise we explore how to apply hard filters and then how to remove variants in regions of poor mappability.
Please use the VCF file generated in Part 1.
Hard Filtering
Soft filtering approaches (e.g. VQSR) attempt to statistically learn which variants are “true.” However, soft filtering requires large cohorts or population-level resources, which may not exist for many organisms or under-sampled populations. For this reason, we often fall back on hard filtering, i.e. applying fixed cutoffs.
Hard filtering is simple but may introduce bias if the filter correlates with variant type (e.g. heterozygous sites often have lower depth). Therefore filters should be chosen thoughtfully.
We will use the following genomic mask file:
/home/databases/databases/GRCh38/mask99.bed.gz
This file is in the BED interval format, which stores genomic regions as:
chromosome start(0-based) end(1-based)
- 0-based means the first base is position 0 (0,1,2,…)
- 1-based means the first base is position 1 (1,2,3,…)
This mask contains genomic regions to exclude (often low-quality or repetitive regions). Most genotypers do not account for duplicated regions, so combining hard filtering with mappability filters is best practice.
A typical hard-filtering command using GATK is:
gatk VariantFiltration \ -V [INPUT VCF] \ -O [OUTPUT VCF] \ -filter "DP < 10.0" --filter-name "DP" \ -filter "QUAL < 30.0" --filter-name "QUAL30" \ -filter "SOR > 3.0" --filter-name "SOR3" \ -filter "FS > 60.0" --filter-name "FS60" \ -filter "MQ < 40.0" --filter-name "MQ40"
Explanation of filters:
| Filter | Meaning |
|---|---|
DP < 10 | Remove sites with <10× coverage |
QUAL < 30 | Remove sites where variant quality <30 (variant QUAL ≠ genotype quality GQ — see <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531392?id=7258">here</a>) |
SOR > 3.0 | Remove sites with strong strand bias (see <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036361772-StrandOddsRatio">here</a>) |
FS > 60 | Remove sites with significant strand bias based on Fisher’s exact test (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036361992-FisherStrand">FS explanation</a>) |
MQ < 40 | Remove variants supported by reads with low mapping quality |
Important: No filter is perfect — always assess how each filter affects variant counts and potential biases.
Q1
How many sites were filtered out?
Hint: sites that pass all filters have PASS in the 7th column. Use grep to count both PASS and non-PASS entries.
Q2
The 7th column contains the filter name(s) for variants that failed.
Using cut, sort, and uniq -c, determine which filter removed the largest number of sites.
Filtering by mappability
Next, we remove variants that fall inside low-mappability regions. These regions cannot uniquely map short reads and often produce false positives.
Use bedtools intersect to retain only variants in high-mappability intervals (≥99% unique mappability):
bedtools intersect -header \ -a [INPUT VCF] \ -b /home/databases/databases/GRCh38/filter99.bed.gz \ | bgzip -c > [OUTPUT VCF]
Name your output:
NA24694_hf_map99.vcf.gz
Q3
How many variants remain after masking low-mappability regions?
Annotation of variants
We now want to characterize the biological context of variants: Are they intronic? Exonic? UTR? Upstream of a promoter?
We will use snpEff for genomic annotation.
java -jar /home/ctools/snpEff/snpEff.jar eff \ -dataDir /home/databases/databases/snpEff/ \ -htmlStats [OUTPUT HTML] \ GRCh38.99 \ [INPUT VCF] \ | bgzip -c > [OUTPUT VCF]
-dataDir: location of snpEff reference databasesGRCh38.99: genome version — MUST match your reference genome exactly
Run this on your hard-filtered VCF (before mappability filtering):
- HTML output:
NA24694_hf.html - Annotated VCF:
NA24694_hf_ann.vcf.gz
View the HTML report:
firefox NA24694_hf.html
Q4
According to the snpEff report, which genomic category contains the most variants (exon, intron, downstream, UTR, etc.)?
Q5
How many variants are predicted to cause a codon change? (See definitions at <a href="https://en.wikipedia.org/wiki/Point_mutation">Point mutation</a>.)
Please find answers here: <a href="SNP_calling_exercise_part_2_answers">SNP_calling_exercise_part_2_answers</a>
Congratulations — you finished the exercise!
Note: When piping bcftools view into other tools, consider specifying the output type using:
-O {b|u|z|v}
This avoids unnecessary compression/decompression steps and speeds up workflows.