SNP calling exercise part 1: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
<H2>Overview</H2>
<h2>Overview</h2>


First:
<p>In this exercise you will perform basic variant calling and start exploring VCF files. You will:</p>
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "variant_call"
<LI>Navigate to the directory you just created.
</OL>


We will:
<ol>
<OL>
  <li>Run a germline variant caller on whole-genome sequencing data</li>
<LI>Genotype some whole-genome sequencing data.
  <li>Get acquainted with VCF and gVCF formats</li>
<LI>Get acquainted with VCF files
  <li>Count and subset variants using command-line tools</li>
<LI>Soft filtering
  <li>Compare “known” vs “novel” variants</li>
<LI>Hard filtering
</ol>
<LI> Annotation of variants
</OL>


----
<p>First:</p>
<ol>
  <li>Navigate to your home directory</li>
  <li>Create a directory called <code>variant_call</code></li>
  <li>Enter the <code>variant_call</code> directory</li>
</ol>


<H2>Genotyping</H2>
<hr>
 
<h2>Genotyping</h2>
 
<p>We will genotype chromosome 20 from a BAM file that has been pre-processed (sorted, duplicate-marked, etc.)</p>
 
<p>The sample is a <a href="https://en.wikipedia.org/wiki/Han_Chinese">Han Chinese</a> male, sequenced to approximately 24.6× coverage.</p>


We will genotype a chromosome from a BAM file that has been processed using the steps we detailed before. It is from a [https://en.wikipedia.org/wiki/Han_Chinese Han Chinese] Male with a depth of coverage of 24.6X.
<pre>
<pre>
/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam
/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam
</pre>
</pre>


It has been indexed. We will first use GATK's HaplotypeCaller, the command-line look something like this:
<p>The BAM file is already indexed.</p>
 
<p>We will use <b>GATK HaplotypeCaller</b> to generate a gVCF file. A typical command looks like this:</p>


<pre>
<pre>
gatk --java-options "-Xmx10g" HaplotypeCaller   -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa   -I [INPUT BAM] -L chr20 -O [OUTPUT ] --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF  
gatk --java-options "-Xmx10g" HaplotypeCaller \
    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    -I [INPUT_BAM] \
    -L chr20 \
    -O [OUTPUT_GVCF] \
    --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
    -ERC GVCF
</pre>
</pre>


" -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa " specifies a reference, "-L chr20" specifies only to consider chromosome 20, "-dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz ", add annotation from SNPs which are known to exist in the human population (see [https://www.ncbi.nlm.nih.gov/snp/ dbSNP]). We should point out that the variation coming from Eurasians is more extensively represented in this database. However, the most genetically diverse populations are found in [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1740-1 Africa]. Finally, " -ERC GVCF" will generate a GVCF (Genomic Variant Call Format) which essentially contains all sites instead of just variant sites. The input is specified above, you can call the output: NA24694.gvcf.gz
<p>Explanation of the key options:</p>
<ul>
  <li><code>-R</code> – reference genome (GRCh38)</li>
  <li><code>-I</code> – input BAM file (here: <code>NA24694.bam</code>)</li>
  <li><code>-L chr20</code> – restrict calling to chromosome 20</li>
  <li><code>--dbsnp</code> – annotate with known variants from dbSNP</li>
  <li><code>-ERC GVCF</code> – emit a gVCF (includes both variant and non-variant blocks)</li>
</ul>


The command takes between '''20 to 30 minutes to run''', so feel free to take a break. Or feel free to copy the one I generated:
<p>Suggested output name:</p>


<pre>
<pre>
/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz
NA24694.gvcf.gz
</pre>
</pre>


We will have a brief look at the file:
<p>This command may take a while to complete. If you are short on time, you can use the precomputed file instead:</p>


<pre>
<pre>
zcat NA24694.gvcf.gz|less -S
/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz
</pre>
</pre>


You will see several lines starting with '#', this is the header portion, scroll down to see the calls. The first 5 columns represent, the name of the chromosome, the coordinate, snp ID (from dbSNP), the reference base and the alternative base.  You should see a mix of sites, some with the mention <NON_REF> in the 5th column which corresponds to the alternative base, those sites are probably invariant, and others where there is a variant in the fifth column.
<p>Take a quick look at the gVCF:</p>


Prior to running GATK, we need to index it:
<pre>
<pre>
tabix -f -p vcf [VCF to index]
zcat NA24694.gvcf.gz | less -S
</pre>
</pre>


Next, we will use the preliminary variant to actually generate only variant sites:
<p>Notes:</p>
<ul>
  <li>Lines starting with <code>#</code> form the header.</li>
  <li>Data lines have at least 10 columns. The first five are:
    <ul>
      <li>CHROM – chromosome name</li>
      <li>POS – genomic coordinate</li>
      <li>ID – variant identifier (e.g. dbSNP ID, or <code>.</code> if unknown)</li>
      <li>REF – reference allele</li>
      <li>ALT – alternate allele(s)</li>
    </ul>
  </li>
  <li>In gVCFs, you will often see <code>&lt;NON_REF&gt;</code> as the ALT allele for invariant blocks.</li>
</ul>
 
<h3>Indexing the gVCF</h3>
 
<p>Before using the gVCF as input for other tools, index it with <b>tabix</b>:</p>
 
<pre>
<pre>
gatk  GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V [INPUT GVCF]  -O [OUTPUT VCF] -L chr20 --dbsnp  /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
tabix -f -p vcf [INPUT_GVCF]
</pre>
</pre>


In your case, the input is the file you generated above and the output can be called NA24694.vcf.gz.
<p>This creates an index file with extension <code>.tbi</code>, allowing fast random access by position.</p>
This command takes less than 5 minutes to run.
 
<h3>Genotyping the gVCF (producing a standard VCF)</h3>
 
<p>Next, we convert the gVCF into a standard VCF with genotypes only at variant sites using <b>GATK GenotypeGVCFs</b>:</p>


Then let us index the file with the same tabix command:
<pre>
<pre>
tabix -f -p vcf [VCF to index]
gatk GenotypeGVCFs \
    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    -V [INPUT_GVCF] \
    -O [OUTPUT_VCF] \
    -L chr20 \
    --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>
</pre>


Just like "samtools index" can allow us to create an index to retrieve portions of a BAM file, tabix is another utility that allows us to retrieve portions of a GVCF file.  
<p>Suggested output name:</p>
 
<pre>
NA24694.vcf.gz
</pre>


<H3> Get acquainted with VCF files </H3>
<p>This step is usually faster than HaplotypeCaller.</p>


'''Q1'''
<p>Index the VCF with tabix:</p>


Using:
<pre>
<pre>
bcftools stats  [input vcf]
tabix -f -p vcf NA24694.vcf.gz
</pre>
</pre>
How many SNPs do you have? (hint: find the line with: "SN 0 number of SNPs")
 
<p>As with BAM indices, the VCF index allows fast region-based queries, but here using tabix rather than samtools.</p>
 
<hr>
 
<h3>Getting Acquainted with VCF Files</h3>
 
<p><b>Q1.</b> Using:</p>
 
<pre>
bcftools stats [INPUT_VCF]
</pre>
 
<p>Find the line that starts with:</p>
 
<pre>
SN   0   number of SNPs
</pre>
 
<p>How many SNPs are present in your VCF?</p>
 
<hr>
 
<p>You can query specific regions of the VCF using tabix:</p>


<pre>
<pre>
tabix -f -p vcf [VCF to index]
tabix [INPUT_VCF] [CHROM]:[START]-[END]
</pre>
</pre>


The index will be stored as a "tbi" file. Then, the VCF file can be queried:
<p>For a single coordinate:</p>


<pre>
<pre>
tabix [VCF to index] [CHROMOSOME]:[START]-[END]
tabix [INPUT_VCF] [CHROM]:[POS]-[POS]
</pre>
</pre>


where [CHROMOSOME] is the chromosome name and [START] [END] are the start and end coordinates respectively.
<hr>
Or for a single coordinate:
 
<p><b>Q2.</b> Using tabix and <code>wc -l</code>, how many total variants are present in the 1 Mb region:</p>
 
<pre>
<pre>
tabix  [VCF to index] [CHROMOSOME]:[COORDINATE]-[COORDINATE]
chr20:32000000-33000000
</pre>
</pre>


'''Q2'''
<p>(Remember that tabix only returns data lines, so you can safely count lines with <code>wc -l</code>.)</p>


Use the command above, determine how many total variants are in the 1 million base pair region "chr20:32000000-33000000"? (hint, remember "wc -l" to count lines).
<hr>


'''Q3'''
<p><b>Q3.</b> <code>bcftools</code> can subset and filter VCF files.</p>


bcftools is a nifty utility that allows us to do various operations on VCF files. Type:
<p>Type:</p>


<pre>
<pre>
bcftools view
bcftools view
</pre>  
</pre>


It contains several options. Try to determine how many SNPs (excluding indels and multi-allelic sites) are there in the '''same region''' as above? (hint: you need to filter for a certain '''type''' of variant, hint2: be careful not to include the header in the number of lines (option: -H)).
<p>and look at the help text. Using <code>bcftools view</code>, determine how many <b>SNPs</b> (excluding indels and multi-allelic variants) are present in the <b>same region</b>:</p>


'''Q4'''
<pre>
chr20:32000000-33000000
</pre>


Use either tabix or bcftools to retrieve the SNP located on chromosome 20 at coordinate 32011209. Tell me in your own words what is the genotype? What about chromosome 20, coordinate 32044279. Become familiar with the different fields in the [https://samtools.github.io/hts-specs/VCFv4.2.pdf VCF specifications]. Pay attention to "1.4 Data lines". For the genotype fields (column 9 and 10) there are more info on the GATK fields: [https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format GATK's VCF-Variant-Call-Format].
<p>Hints:</p>
<ul>
  <li>Filter for variant type (SNPs only).</li>
  <li>Use <code>-H</code> to avoid counting header lines.</li>
  <li>Pipe the result to <code>wc -l</code> to count variants.</li>
</ul>


Also, answer for each site:
<hr>


<OL>
<p><b>Q4.</b> Retrieve the variants at:</p>
<LI> what is the allele depth i.e. how many bases are there of each type?
<LI> what is the depth of coverage i.e. how many bases cover this site?
<LI> What is your genotype quality?
<LI> What are your genotype likelihoods?
</OL>
'''Q5'''


Inspect the SNPS at positions:
<pre>
<pre>
chr20 32974911
chr20:32011209
chr20 64291638
chr20:32044279
</pre>
</pre>


One SNP has poor quality the other has good quality. Which is which? Why do you think this is? (hint: remember the class, the more data you have, the more you are sure).  
<p>You can use either tabix or bcftools, for example:</p>
 
<pre>
tabix NA24694.vcf.gz chr20:32011209-32011209
tabix NA24694.vcf.gz chr20:32044279-32044279
</pre>
 
<p>For each site, answer:</p>
<ol>
  <li>What is the genotype (e.g. 0/0, 0/1, 1/1)?</li>
  <li>What is the allele depth (AD) – how many reads support each allele?</li>
  <li>What is the total depth of coverage (DP) at this site?</li>
  <li>What is the genotype quality (GQ)?</li>
  <li>What are the genotype likelihoods (PL)?</li>
</ol>
 
<p>Use the VCF specification (<a href="https://samtools.github.io/hts-specs/VCFv4.2.pdf">VCFv4.2</a>, especially section 1.4 “Data lines”) and GATK’s VCF documentation (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format">GATK VCF Format</a>) to interpret the FORMAT fields.</p>
 
<hr>
 
<p><b>Q5.</b> Inspect the SNPs at positions:</p>
 
<pre>
chr20    32974911
chr20    64291638
</pre>
 
<p>One of these SNPs has poor quality, the other has good quality.</p>
 
<ul>
  <li>Which is which?</li>
  <li>Why do you think this is the case? (Hint: think about depth, allele balance, and overall evidence.)</li>
</ul>
 
<hr>
 
<p><b>Q6.</b> Using the same region as in Q2/Q3:</p>
 
<pre>
chr20:32000000-33000000
</pre>
 
<p>How many SNPs in this region are <b>novel</b>, i.e. do not have an ID in dbSNP?</p>
 
<p>Hints:</p>
<ul>
  <li>The 3rd column (ID) contains dbSNP IDs (typically starting with <code>rs</code>) or <code>.</code> for novel variants.</li>
  <li>You can use <code>cut</code> to extract the ID column.</li>
  <li><code>grep</code> with <code>-v</code> can be used to exclude lines containing <code>rs</code>, or to include only those lines.</li>
</ul>


'''Q6'''
<hr>


Using the same region, 32000000-33000000, how many are novel (i.e. does not exist in databases of a large number of sampled genomed) SNPs? Note that the third column is the ID of a known SNP in dbSNP. You can use the '''cut'''  command to get a specific column. Also, every ID in dbSNP starts with "rs" and you could use '''grep''' to either retain such lines or filter them out (see option -v).
<p><b>Q7.</b> Compare your result from Q6 (number of novel SNPs) to the number of SNPs you found in Q3 (total SNPs in the region).</p>


'''Q7'''
<ul>
  <li>What fraction of SNPs in this region are novel?</li>
  <li>Does this fraction seem reasonable, given that human variation databases are large but still incomplete?</li>
</ul>


Contrast the number you obtained for questions 6 to the one you obtained for question 3. Do you think it's expected?
<hr>


'''Congratulations you finished the exercise!'''
<p><b>Congratulations, you finished the exercise!</b></p>


Please find answers [[SNP_calling_exercise_part_1_answers_|here]]
<p>You can find suggested answers here: <a href="SNP_calling_exercise_part_1_answers_">SNP_calling_exercise_part_1_answers_</a></p>

Revision as of 14:46, 25 November 2025

Overview

In this exercise you will perform basic variant calling and start exploring VCF files. You will:

  1. Run a germline variant caller on whole-genome sequencing data
  2. Get acquainted with VCF and gVCF formats
  3. Count and subset variants using command-line tools
  4. Compare “known” vs “novel” variants

First:

  1. Navigate to your home directory
  2. Create a directory called variant_call
  3. Enter the variant_call directory

Genotyping

We will genotype chromosome 20 from a BAM file that has been pre-processed (sorted, duplicate-marked, etc.)

The sample is a <a href="https://en.wikipedia.org/wiki/Han_Chinese">Han Chinese</a> male, sequenced to approximately 24.6× coverage.

/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam

The BAM file is already indexed.

We will use GATK HaplotypeCaller to generate a gVCF file. A typical command looks like this:

gatk --java-options "-Xmx10g" HaplotypeCaller \
    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    -I [INPUT_BAM] \
    -L chr20 \
    -O [OUTPUT_GVCF] \
    --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
    -ERC GVCF

Explanation of the key options:

  • -R – reference genome (GRCh38)
  • -I – input BAM file (here: NA24694.bam)
  • -L chr20 – restrict calling to chromosome 20
  • --dbsnp – annotate with known variants from dbSNP
  • -ERC GVCF – emit a gVCF (includes both variant and non-variant blocks)

Suggested output name:

NA24694.gvcf.gz

This command may take a while to complete. If you are short on time, you can use the precomputed file instead:

/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz

Take a quick look at the gVCF:

zcat NA24694.gvcf.gz | less -S

Notes:

  • Lines starting with # form the header.
  • Data lines have at least 10 columns. The first five are:
    • CHROM – chromosome name
    • POS – genomic coordinate
    • ID – variant identifier (e.g. dbSNP ID, or . if unknown)
    • REF – reference allele
    • ALT – alternate allele(s)
  • In gVCFs, you will often see <NON_REF> as the ALT allele for invariant blocks.

Indexing the gVCF

Before using the gVCF as input for other tools, index it with tabix:

tabix -f -p vcf [INPUT_GVCF]

This creates an index file with extension .tbi, allowing fast random access by position.

Genotyping the gVCF (producing a standard VCF)

Next, we convert the gVCF into a standard VCF with genotypes only at variant sites using GATK GenotypeGVCFs:

gatk GenotypeGVCFs \
    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    -V [INPUT_GVCF] \
    -O [OUTPUT_VCF] \
    -L chr20 \
    --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz

Suggested output name:

NA24694.vcf.gz

This step is usually faster than HaplotypeCaller.

Index the VCF with tabix:

tabix -f -p vcf NA24694.vcf.gz

As with BAM indices, the VCF index allows fast region-based queries, but here using tabix rather than samtools.


Getting Acquainted with VCF Files

Q1. Using:

bcftools stats [INPUT_VCF]

Find the line that starts with:

SN    0    number of SNPs

How many SNPs are present in your VCF?


You can query specific regions of the VCF using tabix:

tabix [INPUT_VCF] [CHROM]:[START]-[END]

For a single coordinate:

tabix [INPUT_VCF] [CHROM]:[POS]-[POS]

Q2. Using tabix and wc -l, how many total variants are present in the 1 Mb region:

chr20:32000000-33000000

(Remember that tabix only returns data lines, so you can safely count lines with wc -l.)


Q3. bcftools can subset and filter VCF files.

Type:

bcftools view

and look at the help text. Using bcftools view, determine how many SNPs (excluding indels and multi-allelic variants) are present in the same region:

chr20:32000000-33000000

Hints:

  • Filter for variant type (SNPs only).
  • Use -H to avoid counting header lines.
  • Pipe the result to wc -l to count variants.

Q4. Retrieve the variants at:

chr20:32011209
chr20:32044279

You can use either tabix or bcftools, for example:

tabix NA24694.vcf.gz chr20:32011209-32011209
tabix NA24694.vcf.gz chr20:32044279-32044279

For each site, answer:

  1. What is the genotype (e.g. 0/0, 0/1, 1/1)?
  2. What is the allele depth (AD) – how many reads support each allele?
  3. What is the total depth of coverage (DP) at this site?
  4. What is the genotype quality (GQ)?
  5. What are the genotype likelihoods (PL)?

Use the VCF specification (<a href="https://samtools.github.io/hts-specs/VCFv4.2.pdf">VCFv4.2</a>, especially section 1.4 “Data lines”) and GATK’s VCF documentation (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format">GATK VCF Format</a>) to interpret the FORMAT fields.


Q5. Inspect the SNPs at positions:

chr20    32974911
chr20    64291638

One of these SNPs has poor quality, the other has good quality.

  • Which is which?
  • Why do you think this is the case? (Hint: think about depth, allele balance, and overall evidence.)

Q6. Using the same region as in Q2/Q3:

chr20:32000000-33000000

How many SNPs in this region are novel, i.e. do not have an ID in dbSNP?

Hints:

  • The 3rd column (ID) contains dbSNP IDs (typically starting with rs) or . for novel variants.
  • You can use cut to extract the ID column.
  • grep with -v can be used to exclude lines containing rs, or to include only those lines.

Q7. Compare your result from Q6 (number of novel SNPs) to the number of SNPs you found in Q3 (total SNPs in the region).

  • What fraction of SNPs in this region are novel?
  • Does this fraction seem reasonable, given that human variation databases are large but still incomplete?

Congratulations, you finished the exercise!

You can find suggested answers here: <a href="SNP_calling_exercise_part_1_answers_">SNP_calling_exercise_part_1_answers_</a>