SNP calling exercise
Overview
First:
- Navigate to your home directory:
- Create a directory called "variant_call"
- Navigate to the directory you just created.
We will:
- Genotype some whole-genome sequencing data.
- Get acquainted with VCF files
- Soft filtering
- Hard filtering
- Annotation of variants
Genotyping
We will genotype a chromosome from a BAM file that has been processed using the steps we detailed before. It is from a Han Chinese Male with a depth of coverage of 24.6X.
/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam
It has been indexed. We will first use GATK's HaplotypeCaller, the command-line look something like this:
gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I [INPUT BAM] -L chr20 -O [OUTPUT ] --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF
" -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa " specifies a reference, "-L chr20" specifies only to consider chromosome 20, "-dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz ", add annotation from SNPs which are known to exist in the human population (see dbSNP). We should point out that the variation coming from Eurasians is more extensively represented in this database. However, the most genetically diverse populations are found in Africa. Finally, " -ERC GVCF" will generate a GVCF (Genomic Variant Call Format) which essentially contains all sites instead of just variant sites. The input is specified above, you can call the output: NA24694.gvcf.gz
The command takes between 20 to 30 minutes to run, so feel free to take a break. Or feel free to copy the one I generated:
/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz
We will have a brief look at the file:
zcat NA24694.gvcf.gz|less -S
You will see several lines starting with '#', this is the header portion, scroll down to see the calls. The first 5 columns represent, the name of the chromosome, the coordinate, snp ID (from dbSNP), the reference base and the alternative base. You should see a mix of sites, some with the mention <NON_REF> in the 5th column which corresponds to the alternative base, those sites are probably invariant, and others where there is a variant in the fifth column.
Prior to running GATK, we need to index it:
tabix -f -p vcf [VCF to index]
Next, we will use the preliminary variant to actually generate only variant sites:
gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V [INPUT GVCF] -O [OUTPUT VCF] -L chr20 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
In your case, the input is the file you generated above and the output can be called NA24694.vcf.gz. This command takes less than 5 minutes to run.
Then let us index the file with the same tabix command:
tabix -f -p vcf [VCF to index]
Just like "samtools index" can allow us to create an index to retrieve portions of a BAM file, tabix is another utility that allows us to retrieve portions of a GVCF file.
Get acquainted with VCF files
Q1
Using:
bcftools stats [input vcf]
How many SNPs do you have? (hint: find the line with: "SN 0 number of SNPs")
tabix -f -p vcf [VCF to index]
The index will be stored as a "tbi" file. Then, the VCF file can be queried:
tabix [VCF to index] [CHROMOSOME]:[START]-[END]
where [CHROMOSOME] is the chromosome name and [START] [END] are the start and end coordinates respectively. Or for a single coordinate:
tabix [VCF to index] [CHROMOSOME]:[COORDINATE]-[COORDINATE]
Q2
Use the command above, determine how many total variants are in the 1 million base pair region "chr20:32000000-33000000"? (hint, remember "wc -l" to count lines).
Q3
bcftools is a nifty utility that allows us to do various operations on VCF files. Type:
bcftools view
It contains several options. Try to determine how many SNPs (excluding indels and multi-allelic sites) are there in the same region as above? (hint: you need to filter for a certain type of variant, hint2: be careful not to include the header in the number of lines (option: -H)).
Q4
Use either tabix or bcftools to retrieve the SNP located on chromosome 20 at coordinate 32011209. Tell me in your own words what is the genotype? What about chromosome 20, coordinate 32044279. Become familiar with the different fields in the VCF specifications. Pay attention to "1.4 Data lines". For the genotype fields (column 9 and 10) there are more info on the GATK fields: GATK's VCF-Variant-Call-Format.
Also, answer for each site:
- what is the allele depth i.e. how many bases are there of each type?
- what is the depth of coverage i.e. how many bases cover this site?
- What is your genotype quality?
- What are your genotype likelihoods?
Q5
Inspect the SNPS at positions:
chr20 32974911 chr20 64291638
One SNP has poor quality the other has good quality. Which is which? Why do you think this is? (hint: remember the class, the more data you have, the more you are sure).
Q6
Using the same region, 32000000-33000000, how many are novel (i.e. does not exist in databases of a large number of sampled genomed) SNPs? Note that the third column is the ID of a known SNP in dbSNP. You can use the cut command to get a specific column. Also, every ID in dbSNP starts with "rs" and you could use grep to either retain such lines or filter them out (see option -v).
Q7
Contrast the number you obtained for questions 6 to the one you obtained for question 3. Do you think it's expected?
Filtering
We saw that the data contains some calls of poor quality. Ideally, we do not want to carry over these calls into downstream analyses. We will explore two philosophies to filter out genotype data.
Soft filtering
We saw that with this individual, it is unlikely to get a lot of calls there are completely unknown in the human population. We can leverage that fact to classify existing calls to try to learn what a false positive looks like. we will use VariantRecalibrator the try to build a model of what false-positive calls look like. First make a directory called
var_recal/
And run the following:
gatk --java-options "-Xmx10G" VariantRecalibrator -V [INPUT VCF] --rscript-file var_recal/NA24694_plots.R -O var_recal/NA24694_recal -mode SNP --tranches-file var_recal/NA24694_tranches -tranche 99.0 -tranche 95.0 -tranche 90.0 -tranche 85.0 -tranche 80.0 -tranche 75.0 -tranche 70.0 -tranche 65.0 -tranche 60.0 -tranche 58.0 -an QD -an DP -an FS -an MQRankSum -an ReadPosRankSum -an SOR -an MQ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /home/databases/databases/GRCh38/hapmap_3.3.hg38.vcf.gz -resource:omni,known=false,training=true,truth=false,prior=12.0 /home/databases/databases/GRCh38/1000G_omni2.5.hg38.vcf.gz -resource:1000G,known=false,training=true,truth=false,prior=10.0 /home/databases/databases/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
The first options define the different input/output files. The next defines the different quality "tranche" which is a sensitivity stratum. The higher the value, the more sensitive, the more variants made it through. The lower the value, the less sensitive but only high-quality SNPs make it.
The "-an" define which fields in the VCF to consider when building the model and the "-resource" define the different SNPs that were found in humans.
One interesting file which contains the level of true/false positive calls at different quality cutoffs can be seen using "okular":
okular var_recal/NA24694_tranches.pdf
In blue, you have the true positives, in orange, you have the false negatives. If the bar is in full color, it means that this share of true/false negatives were called after considering this tranche. The diagonal striped ones entail that this share of true/false negatives had been previous called by the previous more stringent tranche. In our example, 55.5 is the most stringent this is why all the colors are full.
Also, note that on the left is the transition/transversion ratio. Ideally, for humans, it should be close to 2.1. You see that it becomes closer to that value as the filtering becomes more stringent.
To reduce false calls, a good tranche selection should allow for as much as possible as true positives while not having too many false negatives. If only false negatives are added this is probably not a good cutoff. Ideally, you want to have as much good quality data as you can.
Q8
Based on this plot, what would be a good sensitivity tranch to use?
Filter your variants based on the quality tranche you selected (please see answers prior to running it) using:
gatk --java-options "-Xmx10G" ApplyVQSR -V [INPUT VCF] -O [OUTPUT VCF] --recal-file var_recal/NA24694_recal --tranches-file var_recal/NA24694_tranches -truth-sensitivity-filter-level 65 --create-output-variant-index true -mode SNP
The input is NA24694.vcf.gz and you can name your output NA24694_sf.vcf.gz. This will add a "PASS" or not in the 7th field.
Use the following command:
bcftools view --type=snps -f PASS NA24694_sf.vcf.gz |bgzip -c > NA24694_sf_pass.vcf.gz
to retain only variants that passed the filtering. bgzip is exactly like gzip however it is a block gzip which allows indexing using tabix. The "-c" is for compressing.
Q9
Repeat the experiment where you count the total number of SNPs in the region chr20:32000000-33000000, how many SNPs are you getting now?
Q10
Had you taken a sensitivity tranche of 80, with the number you've previously answered:
- go up
- go down
- stay the same
- all of the above
Hard filtering
A major downside of soft filtering is that does not apply for samples for which we do not have a good representation of the genetic diversity. An alternative is to do hard filtering meaning filtering the variants according to predetermined cutoffs. This has the downside of potentially introducing bias if the filter is correlated with the type of variant (ex: if heterozygous sites have higher genotype quality).
First we will consider the following mask file:
/home/databases/databases/GRCh38/mask99.bed.gz
It is in BED format which is for genomic intervals. These files are extensively used in next-generation sequencing analyses, have a look at the first lines, the format is:
[chromosome] [start coordinate (0-based)] [end coordinate (1-based)]
0-based means that the first base is at coordinates 0 (i.e 0 1 2 3 ...), 1-based means that the first base is at coordinate 1 (i.e 1 2 3 4...).
This mask file contains a set of genomic regions that you want to remove from downstream analyses. It is important to note that most genotypers do not take into account the fact that a genomic region can be duplicated. This is why it's that a mappability filter is also a good idea to use.
gatk VariantFiltration -V [INPUT VCF] -O [OUTPUT VCF] -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40"
This is what the different filter mean:
Filter | Meaning |
-filter "DP < 10.0" | sites with less than 10X depth of coverage are removed |
-filter "QUAL < 30.0" | sites with a variant quality less than 30 are removed (see the difference between variant quality and genotype quality (GQ) here ) |
-filter "SOR > 3.0" | sites with a strand odds ratio less than 3.0 are removed (strand bias is when a strand is favored over the other, read more here ) |
-filter "FS > 60.0" | sites with a Fisher's exact test (FS) for strand bias is greater than 60 are removed (read more about FS here ). |
-filter "MQ < 40.0" | sites where the median mapping quality of reads supporting is less than 40 are removed. |
In real life, there are no perfect filters, I suggest you progressively add them, measure their effectiveness and make sure that you do not introduce unwanted bias to your analyses.
Q11
How many sites have been filtered out? Remember that sites that passed the filter will have the string PASS as the seventh column, you can use grep with the trick we mentioned to remove lines with a specific string. How many SNPs were filtered out?
Q12
The filtering command leaves in the seventh column a string (specified via --filter-name above) to determine which filter failed sites that were removed. Can you use the command above and modify it using the commands like cut, sort and uniq to determine was the filter that filtered out the most sites? If you are starting to use UNIX, this question might be challenging.
Let's further remove sites that are in genomic regions of poor mappability. We can use bedtools which is a set of utilities to deal with BED files (merge, intersect, etc).
bedtools intersect -header -a [INPUT VCF] -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > [OUTPUT VCF]
The "-header" just says print the header of [INPUT VCF]. Retain the sites that have passed the hard filtering that are contained in filter99.bed.gz and call the output NA24694_hf_map99.vcf.gz. The 99 is from the percentage of DNA fragments required to map uniquely at that particular position (read more here.
Q13
How many sites did you retain?
Annotation of variants
An interesting question is always would was the genomic context of the variants, are they in introns, exons, intergenic. We will use a program called snpEff to characterize the different genomic variants that we found. You can run the program as such:
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats [OUTPUT HTML] GRCh38.99 [INPUT VCF] |bgzip -c > [OUTPUT VCF]
The -dataDir option specifies where to find the data. A human gene database was downloaded here: /home/databases/databases/snpEff/. "GRCh38.99" represents a specific version (hg38) of the human genome. As mentioned in previous exercises, be very careful to use the exact same version of the genome in your analyses.
Run it on file resulting from hard-filtering prior to the filtering by mappability (NA24694_hf.vcf.gz), create an output HTML named: NA24694_hf.html and a VCF output named: NA24694_hf_ann.vcf.gz.
Use firefox to open the HTML report:
firefox NA24694_hf.html
and answer the following questions:
Q14
What is the most common genomic region (exon, downstream, intron, UTR) of the variants we detected?
Q15
How many variants can lead to a codon change? See explanation about point mutations on Wikipedia
Please find answers here
Congratulations you finished the exercise!
Note: in these exercises and answers, sometimes piped the output of "bcftools view" into other programs, ideally you should use the flag:
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
and think about which one is appropriate for your situation whether you're storing data or piping to another program.