SNP calling exercise part 1
Overview
In this exercise you will perform basic variant calling and start exploring VCF files. You will:
- Run a germline variant caller on whole-genome sequencing data
- Get acquainted with VCF and gVCF formats
- Count and subset variants using command-line tools
- Compare “known” vs “novel” variants
First:
- Navigate to your home directory
- Create a directory called
variant_call - Enter the
variant_calldirectory
Genotyping
We will genotype chromosome 20 from a BAM file that has been pre-processed (sorted, duplicate-marked, etc.)
The sample is a Han Chinese male, sequenced to approximately 24.6x coverage.
/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam
The BAM file is already indexed.
We will use GATK HaplotypeCaller to generate a gVCF file. A typical command looks like this:
gatk --java-options "-Xmx10g" HaplotypeCaller \
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-I [INPUT_BAM] \
-L chr20 \
-O [OUTPUT_GVCF] \
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
-ERC GVCF
Explanation of the key options:
-R– reference genome (GRCh38)-I– input BAM file (here:NA24694.bam)-L chr20– restrict calling to chromosome 20--dbsnp– annotate with known variants from dbSNP-ERC GVCF– emit a gVCF (includes both variant and non-variant blocks)
Suggested output name:
NA24694.gvcf.gz
This command may take a while to complete. If you are short on time, you can use the precomputed file instead:
/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz
Take a quick look at the gVCF:
zcat NA24694.gvcf.gz | less -S
Notes:
- Lines starting with
#form the header. - Data lines have at least 10 columns. The first five are:
- CHROM – chromosome name
- POS – genomic coordinate
- ID – variant identifier (e.g. dbSNP ID, or
.if unknown) - REF – reference allele
- ALT – alternate allele(s)
- In gVCFs, you will often see
<NON_REF>as the ALT allele for invariant blocks.
Indexing the gVCF
Before using the gVCF as input for other tools, index it with tabix:
tabix -f -p vcf [INPUT_GVCF]
This creates an index file with extension .tbi, allowing fast random access by position.
Genotyping the gVCF (producing a standard VCF)
Next, we convert the gVCF into a standard VCF with genotypes only at variant sites using GATK GenotypeGVCFs:
gatk GenotypeGVCFs \
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-V [INPUT_GVCF] \
-O [OUTPUT_VCF] \
-L chr20 \
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
Suggested output name:
NA24694.vcf.gz
This step is usually faster than HaplotypeCaller.
Index the VCF with tabix:
tabix -f -p vcf NA24694.vcf.gz
As with BAM indices, the VCF index allows fast region-based queries, but here using tabix rather than samtools.
Getting Acquainted with VCF Files
Q1. Using:
bcftools stats [INPUT_VCF]
Find the line that starts with:
SN 0 number of SNPs
How many SNPs are present in your VCF?
You can query specific regions of the VCF using tabix:
tabix [INPUT_VCF] [CHROM]:[START]-[END]
For a single coordinate:
tabix [INPUT_VCF] [CHROM]:[POS]-[POS]
Q2. Using tabix and wc -l, how many total variants are present in the 1 Mb region:
chr20:32000000-33000000
(Remember that tabix only returns data lines, so you can safely count lines with wc -l.)
Q3. bcftools can subset and filter VCF files.
Type:
bcftools view
and look at the help text. Using bcftools view, determine how many SNPs (excluding indels and multi-allelic variants) are present in the same region:
chr20:32000000-33000000
Hints:
- Filter for variant type (SNPs only).
- Use
-Hto avoid counting header lines. - Pipe the result to
wc -lto count variants.
Q4. Retrieve the variants at:
chr20:32011209 chr20:32044279
You can use either tabix or bcftools, for example:
tabix NA24694.vcf.gz chr20:32011209-32011209 tabix NA24694.vcf.gz chr20:32044279-32044279
For each site, answer:
- What is the genotype (e.g. 0/0, 0/1, 1/1)?
- What is the allele depth (AD) – how many reads support each allele?
- What is the total depth of coverage (DP) at this site?
- What is the genotype quality (GQ)?
- What are the genotype likelihoods (PL)?
Use the VCF specification (<a href="https://samtools.github.io/hts-specs/VCFv4.2.pdf">VCFv4.2</a>, especially section 1.4 “Data lines”) and GATK’s VCF documentation (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format">GATK VCF Format</a>) to interpret the FORMAT fields.
Q5. Inspect the SNPs at positions:
chr20 32974911 chr20 64291638
One of these SNPs has poor quality, the other has good quality.
- Which is which?
- Why do you think this is the case? (Hint: think about depth, allele balance, and overall evidence.)
Q6. Using the same region as in Q2/Q3:
chr20:32000000-33000000
How many SNPs in this region are novel, i.e. do not have an ID in dbSNP?
Hints:
- The 3rd column (ID) contains dbSNP IDs (typically starting with
rs) or.for novel variants. - You can use
cutto extract the ID column. grepwith-vcan be used to exclude lines containingrs, or to include only those lines.
Q7. Compare your result from Q6 (number of novel SNPs) to the number of SNPs you found in Q3 (total SNPs in the region).
- What fraction of SNPs in this region are novel?
- Does this fraction seem reasonable, given that human variation databases are large but still incomplete?
Congratulations, you finished the exercise!