SNP calling exercise part 1
Overview
First:
- Navigate to your home directory:
- Create a directory called "variant_call"
- Navigate to the directory you just created.
We will:
- Genotype some whole-genome sequencing data.
- Get acquainted with VCF files
- Soft filtering
- Hard filtering
- Annotation of variants
Genotyping
We will genotype a chromosome from a BAM file that has been processed using the steps we detailed before. It is from a Han Chinese Male with a depth of coverage of 24.6X.
/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam
It has been indexed. We will first use GATK's HaplotypeCaller, the command-line look something like this:
gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I [INPUT BAM] -L chr20 -O [OUTPUT ] --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF
" -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa " specifies a reference, "-L chr20" specifies only to consider chromosome 20, "-dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz ", add annotation from SNPs which are known to exist in the human population (see dbSNP). We should point out that the variation coming from Eurasians is more extensively represented in this database. However, the most genetically diverse populations are found in Africa. Finally, " -ERC GVCF" will generate a GVCF (Genomic Variant Call Format) which essentially contains all sites instead of just variant sites. The input is specified above, you can call the output: NA24694.gvcf.gz
The command takes between 20 to 30 minutes to run, so feel free to take a break. Or feel free to copy the one I generated:
/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz
We will have a brief look at the file:
zcat NA24694.gvcf.gz|less -S
You will see several lines starting with '#', this is the header portion, scroll down to see the calls. The first 5 columns represent, the name of the chromosome, the coordinate, snp ID (from dbSNP), the reference base and the alternative base. You should see a mix of sites, some with the mention <NON_REF> in the 5th column which corresponds to the alternative base, those sites are probably invariant, and others where there is a variant in the fifth column.
Prior to running GATK, we need to index it:
tabix -f -p vcf [VCF to index]
Next, we will use the preliminary variant to actually generate only variant sites:
gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V [INPUT GVCF] -O [OUTPUT VCF] -L chr20 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
In your case, the input is the file you generated above and the output can be called NA24694.vcf.gz. This command takes less than 5 minutes to run.
Then let us index the file with the same tabix command:
tabix -f -p vcf [VCF to index]
Just like "samtools index" can allow us to create an index to retrieve portions of a BAM file, tabix is another utility that allows us to retrieve portions of a GVCF file.
Get acquainted with VCF files
Q1
Using:
bcftools stats [input vcf]
How many SNPs do you have? (hint: find the line with: "SN 0 number of SNPs")
tabix -f -p vcf [VCF to index]
The index will be stored as a "tbi" file. Then, the VCF file can be queried:
tabix [VCF to index] [CHROMOSOME]:[START]-[END]
where [CHROMOSOME] is the chromosome name and [START] [END] are the start and end coordinates respectively. Or for a single coordinate:
tabix [VCF to index] [CHROMOSOME]:[COORDINATE]-[COORDINATE]
Q2
Use the command above, determine how many total variants are in the 1 million base pair region "chr20:32000000-33000000"? (hint, remember "wc -l" to count lines).
Q3
bcftools is a nifty utility that allows us to do various operations on VCF files. Type:
bcftools view
It contains several options. Try to determine how many SNPs (excluding indels and multi-allelic sites) are there in the same region as above? (hint: you need to filter for a certain type of variant, hint2: be careful not to include the header in the number of lines (option: -H)).
Q4
Use either tabix or bcftools to retrieve the SNP located on chromosome 20 at coordinate 32011209. Tell me in your own words what is the genotype? What about chromosome 20, coordinate 32044279. Become familiar with the different fields in the VCF specifications. Pay attention to "1.4 Data lines". For the genotype fields (column 9 and 10) there are more info on the GATK fields: GATK's VCF-Variant-Call-Format.
Also, answer for each site:
- what is the allele depth i.e. how many bases are there of each type?
- what is the depth of coverage i.e. how many bases cover this site?
- What is your genotype quality?
- What are your genotype likelihoods?
Q5
Inspect the SNPS at positions:
chr20 32974911 chr20 64291638
One SNP has poor quality the other has good quality. Which is which? Why do you think this is? (hint: remember the class, the more data you have, the more you are sure).
Q6
Using the same region, 32000000-33000000, how many are novel (i.e. does not exist in databases of a large number of sampled genomed) SNPs? Note that the third column is the ID of a known SNP in dbSNP. You can use the cut command to get a specific column. Also, every ID in dbSNP starts with "rs" and you could use grep to either retain such lines or filter them out (see option -v).
Q7
Contrast the number you obtained for questions 6 to the one you obtained for question 3. Do you think it's expected?
Congratulations you finished the exercise!
Please find answers here