SNP calling exercise part 2: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
<h2>Filtering</h2>


<H2> Filtering</H2>
<p>We have seen that the VCF contains some low-quality or unreliable variant calls. Before downstream analyses, we generally want to <b>remove poor-quality sites</b> or annotate them so they can be excluded later. In this exercise we explore how to apply <b>hard filters</b> and then how to remove variants in regions of poor mappability.</p>


We saw that the data contains some calls of poor quality. Ideally, we do not want to carry over these calls into downstream analyses. We will explore how to filter out genotype data.
<p>Please use the VCF file generated in Part 1.</p>


Please use the VCF file generated in part 1.
<h3>Hard Filtering</h3>


<p>Soft filtering approaches (e.g. VQSR) attempt to statistically learn which variants are “true.” However, soft filtering requires large cohorts or population-level resources, which may not exist for many organisms or under-sampled populations. For this reason, we often fall back on <b>hard filtering</b>, i.e. applying fixed cutoffs.</p>


<H3> Hard filtering </H3>
<p>Hard filtering is simple but may introduce bias if the filter correlates with variant type (e.g. heterozygous sites often have lower depth). Therefore filters should be chosen thoughtfully.</p>


We saw that soft filtering learns which variants are "true." However, a major downside of soft filtering is that it does not apply to samples for which we do not have a good representation of the genetic diversity.
<p>We will use the following genomic mask file:</p>
 
An alternative is to do hard filtering i.e. filtering the variants according to predetermined cutoffs. This has the downside of potentially introducing bias if the filter is correlated with the type of variant (ex: if heterozygous sites have higher genotype quality).
 
First, we will consider the following mask file:


<pre>
<pre>
Line 19: Line 17:
</pre>
</pre>


It is in BED format which is for genomic intervals. These files are extensively used in next-generation sequencing analyses, have a look at the first lines, the format is:
<p>This file is in the <b>BED interval format</b>, which stores genomic regions as:</p>


<pre>
<pre>
[chromosome] [start coordinate (0-based)] [end coordinate (1-based)]
chromosome   start(0-based)   end(1-based)
</pre>
</pre>


0-based means that the first base is at coordinates 0 (i.e 0 1 2 3 ...), 1-based means that the first base is at coordinate 1 (i.e 1 2 3 4...).
<ul>
  <li><b>0-based</b> means the first base is position 0 (0,1,2,…)</li>
  <li><b>1-based</b> means the first base is position 1 (1,2,3,…)</li>
</ul>


This mask file contains a set of genomic regions that you want to '''remove''' from downstream analyses. It is important to note that most genotypers do not take into account the fact that a genomic region can be duplicated. This is why it's that a mappability filter is also a good idea to use.
<p>This mask contains genomic regions to <b>exclude</b> (often low-quality or repetitive regions). Most genotypers do not account for duplicated regions, so combining hard filtering with mappability filters is best practice.</p>


<p>A typical hard-filtering command using GATK is:</p>


<pre>
<pre>
gatk VariantFiltration     -V [INPUT VCF] -O [OUTPUT VCF]   -filter "DP < 10.0" --filter-name "DP"    -filter "QUAL < 30.0" --filter-name "QUAL30"     -filter "SOR > 3.0" --filter-name "SOR3"     -filter "FS > 60.0" --filter-name "FS60"     -filter "MQ < 40.0" --filter-name "MQ40"  
gatk VariantFiltration \
  -V [INPUT VCF] \
  -O [OUTPUT VCF] \
  -filter "DP < 10.0"   --filter-name "DP" \
   -filter "QUAL < 30.0" --filter-name "QUAL30" \
  -filter "SOR > 3.0"   --filter-name "SOR3" \
  -filter "FS > 60.0"   --filter-name "FS60" \
  -filter "MQ < 40.0"   --filter-name "MQ40"
</pre>
</pre>


<p>Explanation of filters:</p>


This is what the different filter mean:
<table class="wikitable">
<tr><th>Filter</th><th>Meaning</th></tr>
<tr><td><code>DP &lt; 10</code></td><td>Remove sites with &lt;10× coverage</td></tr>
<tr><td><code>QUAL &lt; 30</code></td><td>Remove sites where variant quality &lt;30 
(variant QUAL ≠ genotype quality GQ — see <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531392?id=7258">here</a>)</td></tr>
<tr><td><code>SOR &gt; 3.0</code></td><td>Remove sites with strong strand bias (see <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036361772-StrandOddsRatio">here</a>)</td></tr>
<tr><td><code>FS &gt; 60</code></td><td>Remove sites with significant strand bias based on Fisher’s exact test (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036361992-FisherStrand">FS explanation</a>)</td></tr>
<tr><td><code>MQ &lt; 40</code></td><td>Remove variants supported by reads with low mapping quality</td></tr>
</table>


{| class="wikitable"
<p><b>Important:</b> No filter is perfect — always assess how each filter affects variant counts and potential biases.</p>
| '''Filter'''
| '''Meaning'''   
|-
| -filter "DP < 10.0"
| sites with less than 10X depth of coverage are removed
|-
| -filter "QUAL < 30.0"
| sites with a variant quality less than 30 are removed (see the difference between variant quality and genotype quality (GQ) [https://gatk.broadinstitute.org/hc/en-us/articles/360035531392?id=7258 here] )
|-
| -filter "SOR > 3.0"
| sites with a strand odds ratio less than 3.0 are removed (strand bias is when a strand is favored over the other, read more [https://gatk.broadinstitute.org/hc/en-us/articles/360036361772-StrandOddsRatio here] )
|-
| -filter "FS > 60.0"
| sites with a Fisher's exact test (FS) for strand bias is greater than 60 are removed (read more about FS [https://gatk.broadinstitute.org/hc/en-us/articles/360036361992-FisherStrand here] ).
|-
| -filter "MQ < 40.0"
|sites where the median mapping quality of reads supporting is less than 40 are removed.
|-
|}


<h4>Q1</h4>
<p>How many sites were filtered out? <br>
Hint: sites that pass all filters have <code>PASS</code> in the 7th column. Use <code>grep</code> to count both PASS and non-PASS entries.</p>


In real life, there are no perfect filters, I suggest you progressively add them, measure their effectiveness and make sure that you do not introduce unwanted bias to your analyses.  
<h4>Q2</h4>
<p>The 7th column contains the filter name(s) for variants that failed. 
Using <code>cut</code>, <code>sort</code>, and <code>uniq -c</code>, determine which filter removed the largest number of sites.</p>


'''Q1'''
<hr>


How many sites have been filtered out? Remember that sites that passed the filter will have the string '''PASS''' as the seventh column, you can use '''grep''' with the trick we mentioned to remove lines with a specific string. How many SNPs were filtered out?
<h3>Filtering by mappability</h3>


<p>Next, we remove variants that fall inside <b>low-mappability regions</b>. These regions cannot uniquely map short reads and often produce false positives.</p>


'''Q2'''
<p>Use <b>bedtools intersect</b> to retain only variants in high-mappability intervals (≥99% unique mappability):</p>


The filtering command leaves in the seventh column a string (specified via --filter-name above) to determine which filter failed sites that were removed. Can you use the command above and modify it using the commands like '''cut''', '''sort''' and '''uniq''' to determine was the filter that filtered out the most sites? If you are starting to use UNIX, this question might be challenging.
<pre>
bedtools intersect -header \
  -a [INPUT VCF] \
  -b /home/databases/databases/GRCh38/filter99.bed.gz \
| bgzip -c > [OUTPUT VCF]
</pre>


Let's further remove sites that are in genomic regions of poor mappability. We can use bedtools which is a set of utilities to deal with BED files (merge, intersect, etc).
<p>Name your output:</p>


<pre>
<pre>
bedtools intersect -header -a  [INPUT VCF] -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > [OUTPUT VCF]
NA24694_hf_map99.vcf.gz
</pre>
</pre>


The "-header" just says print the header of [INPUT VCF]. Retain the sites that have '''passed the hard filtering''' that are contained in filter99.bed.gz and call the output NA24694_hf_map99.vcf.gz. The 99 is from the percentage of DNA fragments required to map uniquely at that particular position (read more [http://lh3lh3.users.sourceforge.net/snpable.shtml here].
<h4>Q3</h4>


'''Q3'''
<p>How many variants remain after masking low-mappability regions?</p>


How many sites did you retain?
<hr>


----
<h2>Annotation of variants</h2>


<H2>Annotation of variants</H2>
<p>We now want to characterize the biological context of variants
Are they intronic? Exonic? UTR? Upstream of a promoter?</p>


An interesting question is always would was the genomic context of the variants, are they in introns, exons, intergenic. We will use a program called snpEff to characterize the different genomic variants that we found. You can run the program as such:
<p>We will use <b>snpEff</b> for genomic annotation.</p>


<pre>
<pre>
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats [OUTPUT HTML] GRCh38.99 [INPUT VCF]  |bgzip -c > [OUTPUT VCF]
java -jar /home/ctools/snpEff/snpEff.jar eff \
  -dataDir /home/databases/databases/snpEff/ \
  -htmlStats [OUTPUT HTML] \
  GRCh38.99 \
  [INPUT VCF] \
  | bgzip -c > [OUTPUT VCF]
</pre>
</pre>


The -dataDir option specifies where to find the data. A human gene database was downloaded here: /home/databases/databases/snpEff/. "GRCh38.99" represents a specific version (hg38) of the human genome. As mentioned in previous exercises, be very careful to use the exact same version of the genome in your analyses.
<ul>
  <li><code>-dataDir</code>: location of snpEff reference databases</li>
  <li><code>GRCh38.99</code>: genome version — MUST match your reference genome exactly</li>
</ul>
 
<p>Run this on your hard-filtered VCF (before mappability filtering):</p>


Run it on file resulting from hard-filtering prior to the filtering by mappability (NA24694_hf.vcf.gz), create an output HTML named: NA24694_hf.html and a VCF output named: NA24694_hf_ann.vcf.gz.
<ul>
  <li>HTML output: <code>NA24694_hf.html</code></li>
  <li>Annotated VCF: <code>NA24694_hf_ann.vcf.gz</code></li>
</ul>


Use firefox to open the HTML report:
<p>View the HTML report:</p>


<pre>
<pre>
Line 102: Line 126:
</pre>
</pre>


and answer the following questions:
<h4>Q4</h4>
 
<p>According to the snpEff report, which genomic category contains the most variants (exon, intron, downstream, UTR, etc.)?</p>
'''Q4'''


What is the most common genomic region (exon, downstream, intron, UTR) of the variants we detected?
<h4>Q5</h4>
<p>How many variants are predicted to cause a <b>codon change</b>? 
(See definitions at <a href="https://en.wikipedia.org/wiki/Point_mutation">Point mutation</a>.)</p>


'''Q5'''
<hr>


How many variants can lead to a codon change? See explanation about point mutations on [https://en.wikipedia.org/wiki/Point_mutation Wikipedia]
<p>Please find answers here: <a href="SNP_calling_exercise_part_2_answers">SNP_calling_exercise_part_2_answers</a></p>


<p><b>Congratulations — you finished the exercise!</b></p>


----
<p><i>Note:</i> When piping <code>bcftools view</code> into other tools, consider specifying the output type using:</p>


Please find answers [[SNP_calling_exercise_part_2_answers|here]]
'''Congratulations you finished the exercise!'''
Note: in these exercises and answers, sometimes piped the output of "bcftools view" into other programs, ideally you should use the flag:
<pre>
<pre>
-O,  --output-type <b|u|z|v>      b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
-O {b|u|z|v}
</pre>
</pre>
and think about which one is appropriate for your situation whether you're storing data or piping to another program.
 
<p>This avoids unnecessary compression/decompression steps and speeds up workflows.</p>

Revision as of 10:12, 26 November 2025

Filtering

We have seen that the VCF contains some low-quality or unreliable variant calls. Before downstream analyses, we generally want to remove poor-quality sites or annotate them so they can be excluded later. In this exercise we explore how to apply hard filters and then how to remove variants in regions of poor mappability.

Please use the VCF file generated in Part 1.

Hard Filtering

Soft filtering approaches (e.g. VQSR) attempt to statistically learn which variants are “true.” However, soft filtering requires large cohorts or population-level resources, which may not exist for many organisms or under-sampled populations. For this reason, we often fall back on hard filtering, i.e. applying fixed cutoffs.

Hard filtering is simple but may introduce bias if the filter correlates with variant type (e.g. heterozygous sites often have lower depth). Therefore filters should be chosen thoughtfully.

We will use the following genomic mask file:

/home/databases/databases/GRCh38/mask99.bed.gz

This file is in the BED interval format, which stores genomic regions as:

chromosome   start(0-based)   end(1-based)
  • 0-based means the first base is position 0 (0,1,2,…)
  • 1-based means the first base is position 1 (1,2,3,…)

This mask contains genomic regions to exclude (often low-quality or repetitive regions). Most genotypers do not account for duplicated regions, so combining hard filtering with mappability filters is best practice.

A typical hard-filtering command using GATK is:

gatk VariantFiltration \
   -V [INPUT VCF] \
   -O [OUTPUT VCF] \
   -filter "DP < 10.0"   --filter-name "DP" \
   -filter "QUAL < 30.0" --filter-name "QUAL30" \
   -filter "SOR > 3.0"   --filter-name "SOR3" \
   -filter "FS > 60.0"   --filter-name "FS60" \
   -filter "MQ < 40.0"   --filter-name "MQ40"

Explanation of filters:

FilterMeaning
DP < 10Remove sites with <10× coverage
QUAL < 30Remove sites where variant quality <30 (variant QUAL ≠ genotype quality GQ — see <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531392?id=7258">here</a>)
SOR > 3.0Remove sites with strong strand bias (see <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036361772-StrandOddsRatio">here</a>)
FS > 60Remove sites with significant strand bias based on Fisher’s exact test (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036361992-FisherStrand">FS explanation</a>)
MQ < 40Remove variants supported by reads with low mapping quality

Important: No filter is perfect — always assess how each filter affects variant counts and potential biases.

Q1

How many sites were filtered out?
Hint: sites that pass all filters have PASS in the 7th column. Use grep to count both PASS and non-PASS entries.

Q2

The 7th column contains the filter name(s) for variants that failed. Using cut, sort, and uniq -c, determine which filter removed the largest number of sites.


Filtering by mappability

Next, we remove variants that fall inside low-mappability regions. These regions cannot uniquely map short reads and often produce false positives.

Use bedtools intersect to retain only variants in high-mappability intervals (≥99% unique mappability):

bedtools intersect -header \
   -a [INPUT VCF] \
   -b /home/databases/databases/GRCh38/filter99.bed.gz \
 | bgzip -c > [OUTPUT VCF]

Name your output:

NA24694_hf_map99.vcf.gz

Q3

How many variants remain after masking low-mappability regions?


Annotation of variants

We now want to characterize the biological context of variants: Are they intronic? Exonic? UTR? Upstream of a promoter?

We will use snpEff for genomic annotation.

java -jar /home/ctools/snpEff/snpEff.jar eff \
   -dataDir /home/databases/databases/snpEff/ \
   -htmlStats [OUTPUT HTML] \
   GRCh38.99 \
   [INPUT VCF] \
 | bgzip -c > [OUTPUT VCF]
  • -dataDir: location of snpEff reference databases
  • GRCh38.99: genome version — MUST match your reference genome exactly

Run this on your hard-filtered VCF (before mappability filtering):

  • HTML output: NA24694_hf.html
  • Annotated VCF: NA24694_hf_ann.vcf.gz

View the HTML report:

firefox NA24694_hf.html

Q4

According to the snpEff report, which genomic category contains the most variants (exon, intron, downstream, UTR, etc.)?

Q5

How many variants are predicted to cause a codon change? (See definitions at <a href="https://en.wikipedia.org/wiki/Point_mutation">Point mutation</a>.)


Please find answers here: <a href="SNP_calling_exercise_part_2_answers">SNP_calling_exercise_part_2_answers</a>

Congratulations — you finished the exercise!

Note: When piping bcftools view into other tools, consider specifying the output type using:

-O {b|u|z|v}

This avoids unnecessary compression/decompression steps and speeds up workflows.