SNP calling exercise part 1

From 22126
Revision as of 14:48, 25 November 2025 by Mick (talk | contribs)
Jump to navigation Jump to search

Overview

In this exercise you will perform basic variant calling and start exploring VCF files. You will:

  1. Run a germline variant caller on whole-genome sequencing data
  2. Get acquainted with VCF and gVCF formats
  3. Count and subset variants using command-line tools
  4. Compare “known” vs “novel” variants

First:

  1. Navigate to your home directory
  2. Create a directory called variant_call
  3. Enter the variant_call directory

Genotyping

We will genotype chromosome 20 from a BAM file that has been pre-processed (sorted, duplicate-marked, etc.)

The sample is a Han Chinese male, sequenced to approximately 24.6x coverage.

/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam

The BAM file is already indexed.

We will use GATK HaplotypeCaller to generate a gVCF file. A typical command looks like this:

gatk --java-options "-Xmx10g" HaplotypeCaller \
    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    -I [INPUT_BAM] \
    -L chr20 \
    -O [OUTPUT_GVCF] \
    --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
    -ERC GVCF

Explanation of the key options:

  • -R – reference genome (GRCh38)
  • -I – input BAM file (here: NA24694.bam)
  • -L chr20 – restrict calling to chromosome 20
  • --dbsnp – annotate with known variants from dbSNP
  • -ERC GVCF – emit a gVCF (includes both variant and non-variant blocks)

Suggested output name:

NA24694.gvcf.gz

This command may take a while to complete. If you are short on time, you can use the precomputed file instead:

/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz

Take a quick look at the gVCF:

zcat NA24694.gvcf.gz | less -S

Notes:

  • Lines starting with # form the header.
  • Data lines have at least 10 columns. The first five are:
    • CHROM – chromosome name
    • POS – genomic coordinate
    • ID – variant identifier (e.g. dbSNP ID, or . if unknown)
    • REF – reference allele
    • ALT – alternate allele(s)
  • In gVCFs, you will often see <NON_REF> as the ALT allele for invariant blocks.

Indexing the gVCF

Before using the gVCF as input for other tools, index it with tabix:

tabix -f -p vcf [INPUT_GVCF]

This creates an index file with extension .tbi, allowing fast random access by position.

Genotyping the gVCF (producing a standard VCF)

Next, we convert the gVCF into a standard VCF with genotypes only at variant sites using GATK GenotypeGVCFs:

gatk GenotypeGVCFs \
    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    -V [INPUT_GVCF] \
    -O [OUTPUT_VCF] \
    -L chr20 \
    --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz

Suggested output name:

NA24694.vcf.gz

This step is usually faster than HaplotypeCaller.

Index the VCF with tabix:

tabix -f -p vcf NA24694.vcf.gz

As with BAM indices, the VCF index allows fast region-based queries, but here using tabix rather than samtools.


Getting Acquainted with VCF Files

Q1. Using:

bcftools stats [INPUT_VCF]

Find the line that starts with:

SN    0    number of SNPs

How many SNPs are present in your VCF?


You can query specific regions of the VCF using tabix:

tabix [INPUT_VCF] [CHROM]:[START]-[END]

For a single coordinate:

tabix [INPUT_VCF] [CHROM]:[POS]-[POS]

Q2. Using tabix and wc -l, how many total variants are present in the 1 Mb region:

chr20:32000000-33000000

(Remember that tabix only returns data lines, so you can safely count lines with wc -l.)


Q3. bcftools can subset and filter VCF files.

Type:

bcftools view

and look at the help text. Using bcftools view, determine how many SNPs (excluding indels and multi-allelic variants) are present in the same region:

chr20:32000000-33000000

Hints:

  • Filter for variant type (SNPs only).
  • Use -H to avoid counting header lines.
  • Pipe the result to wc -l to count variants.

Q4. Retrieve the variants at:

chr20:32011209
chr20:32044279

You can use either tabix or bcftools, for example:

tabix NA24694.vcf.gz chr20:32011209-32011209
tabix NA24694.vcf.gz chr20:32044279-32044279

For each site, answer:

  1. What is the genotype (e.g. 0/0, 0/1, 1/1)?
  2. What is the allele depth (AD) – how many reads support each allele?
  3. What is the total depth of coverage (DP) at this site?
  4. What is the genotype quality (GQ)?
  5. What are the genotype likelihoods (PL)?

Use the VCF specification (<a href="https://samtools.github.io/hts-specs/VCFv4.2.pdf">VCFv4.2</a>, especially section 1.4 “Data lines”) and GATK’s VCF documentation (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format">GATK VCF Format</a>) to interpret the FORMAT fields.


Q5. Inspect the SNPs at positions:

chr20    32974911
chr20    64291638

One of these SNPs has poor quality, the other has good quality.

  • Which is which?
  • Why do you think this is the case? (Hint: think about depth, allele balance, and overall evidence.)

Q6. Using the same region as in Q2/Q3:

chr20:32000000-33000000

How many SNPs in this region are novel, i.e. do not have an ID in dbSNP?

Hints:

  • The 3rd column (ID) contains dbSNP IDs (typically starting with rs) or . for novel variants.
  • You can use cut to extract the ID column.
  • grep with -v can be used to exclude lines containing rs, or to include only those lines.

Q7. Compare your result from Q6 (number of novel SNPs) to the number of SNPs you found in Q3 (total SNPs in the region).

  • What fraction of SNPs in this region are novel?
  • Does this fraction seem reasonable, given that human variation databases are large but still incomplete?

Congratulations, you finished the exercise!

You can find suggested answers here: <a href="SNP_calling_exercise_part_1_answers_">SNP_calling_exercise_part_1_answers_</a>