SNP calling exercise part 2: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 1: Line 1:
<h2>Filtering</h2>
<h2>Filtering</h2>


<p>We have seen that the VCF contains some low-quality or unreliable variant calls. Before downstream analyses, we generally want to <b>remove poor-quality sites</b> or annotate them so they can be excluded later. In this exercise we explore how to apply <b>hard filters</b> and then how to remove variants in regions of poor mappability.</p>
<p>We have seen that the VCF contains some low-quality or unreliable variant calls. Before downstream analyses, we generally want to <b>remove poor-quality sites</b> or annotate them so they can be excluded later. In this exercise we explore how to apply <b>hard filters</b> and how to remove variants in regions of poor mappability.</p>


<p>Please use the VCF file generated in Part 1.</p>
<p>Please use the VCF file generated in Part 1.</p>
Line 7: Line 7:
<h3>Hard Filtering</h3>
<h3>Hard Filtering</h3>


<p>Soft filtering approaches (e.g. VQSR) attempt to statistically learn which variants are “true.” However, soft filtering requires large cohorts or population-level resources, which may not exist for many organisms or under-sampled populations. For this reason, we often fall back on <b>hard filtering</b>, i.e. applying fixed cutoffs.</p>
<p>Soft filtering approaches (e.g. VQSR) attempt to statistically learn which variants are “true.” However, these approaches require large cohorts or population-level resources, which may not exist for many organisms or under-sampled populations. For this reason, we often fall back on <b>hard filtering</b>, i.e. applying fixed cutoffs.</p>


<p>Hard filtering is simple but may introduce bias if the filter correlates with variant type (e.g. heterozygous sites often have lower depth). Therefore filters should be chosen thoughtfully.</p>
<p>Hard filtering is simple but may introduce bias if the filter correlates with variant type (e.g. heterozygous sites often have lower depth). Filters should be chosen thoughtfully.</p>


<p>We will use the following genomic mask file:</p>
<p>We will use the following genomic mask file:</p>
Line 24: Line 24:


<ul>
<ul>
   <li><b>0-based</b> means the first base is position 0 (0,1,2,…)</li>
   <li><b>0-based</b>: first base has coordinate 0</li>
   <li><b>1-based</b> means the first base is position 1 (1,2,3,…)</li>
   <li><b>1-based</b>: first base has coordinate 1</li>
</ul>
</ul>


<p>This mask contains genomic regions to <b>exclude</b> (often low-quality or repetitive regions). Most genotypers do not account for duplicated regions, so combining hard filtering with mappability filters is best practice.</p>
<p>This mask contains genomic regions to <b>exclude</b> (often low-quality or repetitive regions). Because most genotypers do not recognize duplicated regions, combining hard filtering with mappability filters is best practice.</p>


<p>A typical hard-filtering command using GATK is:</p>
<p>A typical hard-filtering command using GATK is:</p>
Line 47: Line 47:
<table class="wikitable">
<table class="wikitable">
<tr><th>Filter</th><th>Meaning</th></tr>
<tr><th>Filter</th><th>Meaning</th></tr>
<tr><td><code>DP &lt; 10</code></td><td>Remove sites with &lt;10× coverage</td></tr>
 
<tr><td><code>QUAL &lt; 30</code></td><td>Remove sites where variant quality &lt;30   
<tr>
(variant QUAL ≠ genotype quality GQ — see <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531392?id=7258">here</a>)</td></tr>
  <td><code>DP &lt; 10</code></td>
<tr><td><code>SOR &gt; 3.0</code></td><td>Remove sites with strong strand bias (see <a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036361772-StrandOddsRatio">here</a>)</td></tr>
  <td>Remove sites with &lt;10× coverage</td>
<tr><td><code>FS &gt; 60</code></td><td>Remove sites with significant strand bias based on Fisher’s exact test (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360036361992-FisherStrand">FS explanation</a>)</td></tr>
</tr>
<tr><td><code>MQ &lt; 40</code></td><td>Remove variants supported by reads with low mapping quality</td></tr>
 
<tr>
  <td><code>QUAL &lt; 30</code></td>
  <td>Remove sites where variant quality &lt;30   
      (variant QUAL ≠ genotype quality GQ — explanation: 
      [https://gatk.broadinstitute.org/hc/en-us/articles/360035531392 Variant QUAL vs GQ])</td>
</tr>
 
<tr>
  <td><code>SOR &gt; 3.0</code></td>
  <td>Remove sites with strong strand bias
      ([https://gatk.broadinstitute.org/hc/en-us/articles/360036361772 StrandOddsRatio])</td>
</tr>
 
<tr>
  <td><code>FS &gt; 60</code></td>
  <td>Remove variants failing Fisher Strand bias test
      ([https://gatk.broadinstitute.org/hc/en-us/articles/360036361992 FisherStrand])</td>
</tr>
 
<tr>
  <td><code>MQ &lt; 40</code></td>
  <td>Remove sites where reads have low mapping quality</td>
</tr>
 
</table>
</table>


<p><b>Important:</b> No filter is perfect — always assess how each filter affects variant counts and potential biases.</p>
<p><b>Note:</b> No filter is perfect — you should progressively add filters, evaluate their impact, and ensure that you do not introduce unwanted biases.</p>


<h4>Q1</h4>
<h4>Q1</h4>
<p>How many sites were filtered out? <br>
<p>How many sites were filtered out?
Hint: sites that pass all filters have <code>PASS</code> in the 7th column. Use <code>grep</code> to count both PASS and non-PASS entries.</p>
Sites that pass all filters have <code>PASS</code> in the 7th column. Use <code>grep</code> to count PASS vs non-PASS entries.</p>


<h4>Q2</h4>
<h4>Q2</h4>
<p>The 7th column contains the filter name(s) for variants that failed.   
<p>The 7th column contains the name(s) of the filters that failed.   
Using <code>cut</code>, <code>sort</code>, and <code>uniq -c</code>, determine which filter removed the largest number of sites.</p>
Using <code>cut</code>, <code>sort</code>, and <code>uniq -c</code>, determine which filter removed the most sites.</p>


<hr>
<hr>


<h3>Filtering by mappability</h3>
<h3>Filtering by Mappability</h3>


<p>Next, we remove variants that fall inside <b>low-mappability regions</b>. These regions cannot uniquely map short reads and often produce false positives.</p>
<p>Next, we remove variants that fall inside <b>low-mappability regions</b>, because reads cannot be uniquely mapped there and false positives are common.</p>


<p>Use <b>bedtools intersect</b> to retain only variants in high-mappability intervals (≥99% unique mappability):</p>
<p>Use <b>bedtools intersect</b> to retain only variants located in high-mappability intervals (≥99% unique mappability):</p>


<pre>
<pre>
Line 85: Line 109:
NA24694_hf_map99.vcf.gz
NA24694_hf_map99.vcf.gz
</pre>
</pre>
<p>The "99" refers to the proportion of synthetic reads that map uniquely at that position. Explanation: [http://lh3lh3.users.sourceforge.net/snpable.shtml SNPable / mappability].</p>


<h4>Q3</h4>
<h4>Q3</h4>
 
<p>How many variants remain after removing low-mappability regions?</p>
<p>How many variants remain after masking low-mappability regions?</p>


<hr>
<hr>


<h2>Annotation of variants</h2>
<h2>Annotation of Variants</h2>
 
<p>We now want to characterize the biological context of variants: 
Are they intronic? Exonic? UTR? Upstream of a promoter?</p>


<p>We will use <b>snpEff</b> for genomic annotation.</p>
<p>Next, we examine the <b>genomic context</b> of variants: intronic, exonic, intergenic, UTR, etc. We use <b>snpEff</b> for variant annotation.</p>


<pre>
<pre>
Line 109: Line 131:


<ul>
<ul>
   <li><code>-dataDir</code>: location of snpEff reference databases</li>
   <li><code>-dataDir</code>: location of snpEff databases</li>
   <li><code>GRCh38.99</code>: genome version — MUST match your reference genome exactly</li>
   <li><code>GRCh38.99</code>: genome version — must match the reference genome you used earlier</li>
</ul>
</ul>


<p>Run this on your hard-filtered VCF (before mappability filtering):</p>
<p>Run snpEff on your hard-filtered VCF (before mappability filtering):</p>


<ul>
<ul>
   <li>HTML output: <code>NA24694_hf.html</code></li>
   <li>HTML report: <code>NA24694_hf.html</code></li>
   <li>Annotated VCF: <code>NA24694_hf_ann.vcf.gz</code></li>
   <li>Annotated VCF: <code>NA24694_hf_ann.vcf.gz</code></li>
</ul>
</ul>


<p>View the HTML report:</p>
<p>Open the HTML report:</p>


<pre>
<pre>
Line 127: Line 149:


<h4>Q4</h4>
<h4>Q4</h4>
<p>According to the snpEff report, which genomic category contains the most variants (exon, intron, downstream, UTR, etc.)?</p>
<p>Which genomic region category contains the most variants (exon, intron, upstream, downstream, UTR, etc.)?</p>


<h4>Q5</h4>
<h4>Q5</h4>
<p>How many variants are predicted to cause a <b>codon change</b>?   
<p>How many variants are predicted to cause a <b>codon change</b>?   
(See definitions at <a href="https://en.wikipedia.org/wiki/Point_mutation">Point mutation</a>.)</p>
See explanations at: [https://en.wikipedia.org/wiki/Point_mutation Point mutation]</p>


<hr>
<hr>


<p>Please find answers here: <a href="SNP_calling_exercise_part_2_answers">SNP_calling_exercise_part_2_answers</a></p>
<p>Please find answers here:
<a href="SNP_calling_exercise_part_2_answers">SNP_calling_exercise_part_2_answers</a></p>


<p><b>Congratulations — you finished the exercise!</b></p>
<p><b>Congratulations — you finished the exercise!</b></p>
Line 145: Line 168:
</pre>
</pre>


<p>This avoids unnecessary compression/decompression steps and speeds up workflows.</p>
<p>This avoids unnecessary compression/decompression and speeds up workflows.</p>

Latest revision as of 10:14, 26 November 2025

Filtering

We have seen that the VCF contains some low-quality or unreliable variant calls. Before downstream analyses, we generally want to remove poor-quality sites or annotate them so they can be excluded later. In this exercise we explore how to apply hard filters and how to remove variants in regions of poor mappability.

Please use the VCF file generated in Part 1.

Hard Filtering

Soft filtering approaches (e.g. VQSR) attempt to statistically learn which variants are “true.” However, these approaches require large cohorts or population-level resources, which may not exist for many organisms or under-sampled populations. For this reason, we often fall back on hard filtering, i.e. applying fixed cutoffs.

Hard filtering is simple but may introduce bias if the filter correlates with variant type (e.g. heterozygous sites often have lower depth). Filters should be chosen thoughtfully.

We will use the following genomic mask file:

/home/databases/databases/GRCh38/mask99.bed.gz

This file is in the BED interval format, which stores genomic regions as:

chromosome   start(0-based)   end(1-based)
  • 0-based: first base has coordinate 0
  • 1-based: first base has coordinate 1

This mask contains genomic regions to exclude (often low-quality or repetitive regions). Because most genotypers do not recognize duplicated regions, combining hard filtering with mappability filters is best practice.

A typical hard-filtering command using GATK is:

gatk VariantFiltration \
   -V [INPUT VCF] \
   -O [OUTPUT VCF] \
   -filter "DP < 10.0"   --filter-name "DP" \
   -filter "QUAL < 30.0" --filter-name "QUAL30" \
   -filter "SOR > 3.0"   --filter-name "SOR3" \
   -filter "FS > 60.0"   --filter-name "FS60" \
   -filter "MQ < 40.0"   --filter-name "MQ40"

Explanation of filters:

FilterMeaning
DP < 10 Remove sites with <10× coverage
QUAL < 30 Remove sites where variant quality <30
     (variant QUAL ≠ genotype quality GQ — explanation:  
Variant QUAL vs GQ)
SOR > 3.0 Remove sites with strong strand bias (StrandOddsRatio)
FS > 60 Remove variants failing Fisher Strand bias test (FisherStrand)
MQ < 40 Remove sites where reads have low mapping quality

Note: No filter is perfect — you should progressively add filters, evaluate their impact, and ensure that you do not introduce unwanted biases.

Q1

How many sites were filtered out? Sites that pass all filters have PASS in the 7th column. Use grep to count PASS vs non-PASS entries.

Q2

The 7th column contains the name(s) of the filters that failed. Using cut, sort, and uniq -c, determine which filter removed the most sites.


Filtering by Mappability

Next, we remove variants that fall inside low-mappability regions, because reads cannot be uniquely mapped there and false positives are common.

Use bedtools intersect to retain only variants located in high-mappability intervals (≥99% unique mappability):

bedtools intersect -header \
   -a [INPUT VCF] \
   -b /home/databases/databases/GRCh38/filter99.bed.gz \
 | bgzip -c > [OUTPUT VCF]

Name your output:

NA24694_hf_map99.vcf.gz

The "99" refers to the proportion of synthetic reads that map uniquely at that position. Explanation: SNPable / mappability.

Q3

How many variants remain after removing low-mappability regions?


Annotation of Variants

Next, we examine the genomic context of variants: intronic, exonic, intergenic, UTR, etc. We use snpEff for variant annotation.

java -jar /home/ctools/snpEff/snpEff.jar eff \
   -dataDir /home/databases/databases/snpEff/ \
   -htmlStats [OUTPUT HTML] \
   GRCh38.99 \
   [INPUT VCF] \
 | bgzip -c > [OUTPUT VCF]
  • -dataDir: location of snpEff databases
  • GRCh38.99: genome version — must match the reference genome you used earlier

Run snpEff on your hard-filtered VCF (before mappability filtering):

  • HTML report: NA24694_hf.html
  • Annotated VCF: NA24694_hf_ann.vcf.gz

Open the HTML report:

firefox NA24694_hf.html

Q4

Which genomic region category contains the most variants (exon, intron, upstream, downstream, UTR, etc.)?

Q5

How many variants are predicted to cause a codon change? See explanations at: Point mutation


Please find answers here: <a href="SNP_calling_exercise_part_2_answers">SNP_calling_exercise_part_2_answers</a>

Congratulations — you finished the exercise!

Note: When piping bcftools view into other tools, consider specifying the output type using:

-O {b|u|z|v}

This avoids unnecessary compression/decompression and speeds up workflows.