Longread exercise
Overview
First:
- Navigate to your home directory:
- Create a directory called "longread"
- Navigate to the directory you just created.
We will phase some variants using WhatsHap (no not the messaging app).
First, what is phasing?
Phasing means that we determine which base is on the same chromosome as another base for neighboring variants. Let's consider a small example with just two variants (single nucleotide polymorphisms or SNPs) to illustrate phasing:
- SNP1: Located on chromosome 1 at position 1000. The individual is heterozygous A or G.
- SNP2: Located on chromosome 1 at position 2000. The individual is heterozygous C or T.
Great! but do we have:
- A and C on the same chromosome and G and T on the other chromosome
- A and T on the same chromosome and G and C on the other chromosome
Without phasing we don't have this information. This is important because phasing informs us about the phenotypic (therefore for health and reaction to drugs/treatments) consequences of the different bases.
In a VCF, unphased variants will appear like this:
chr1 1000 rs123 A G 29 PASS INFO GT 0/1 chr1 2000 rs456 C T 29 PASS INFO GT 0/1
0/1 means heterozygous reference+alternative.
Now, if A and C are on the same chromosome, phased variants can appear as:
chr1 1000 rs123 A G 29 PASS INFO GT 0|1 chr1 2000 rs456 C T 29 PASS INFO GT 0|1
but if A and C are on different chromosomes, phased variants can appear as:
chr1 1000 rs123 A G 29 PASS INFO GT 0|1 chr1 2000 rs456 C T 29 PASS INFO GT 1|0
In this exercise, we will:
- Do standard genotyping using BGI sequencing from an Ashkenazi individual
- Align long reads from PacBio
- Learn how to install software using bioconda
- Use the long reads to phase our variants
Genotyping with BGI reads
The reads are here:
/home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz /home/projects/22126_NGS/exercises/long_reads/BGI2.fq.gz
They do not have adapters. As we have previously covered aligning and genotyping, you can copy paste the commands, just make sure you understand what you are doing. First, let's have a look at BGI-Seq data:
zcat /home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz |head
You will notice it is very much like Illumina in terms of read length and encoding.
Just go ahead and align them using bwa mem and sort them:
bwa mem -R "@RG\tID:HG002\tSM:HG002" -t 10 /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa BGI1.fq.gz BGI2.fq.gz |samtools view -uS - |samtools sort /dev/stdin > BGI_hg38.bam
Let's remove duplicates:
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I BGI_hg38.bam -M BGI_hg38_metrics.txt -O BGI_hg38_rmdup.bam
then index:
samtools index BGI_hg38_rmdup.bam
Let's genotype:
gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I BGI_hg38_rmdup.bam -L chr20:2000000-3000000 -O BGI_hg38_chr20.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF gatk IndexFeatureFile -I BGI_hg38_chr20.gvcf.gz gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V BGI_hg38_chr20.gvcf.gz -O BGI_hg38_chr20.vcf.gz -L chr20:2000000-3000000 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
Notice that GATK will phase variants on the same read (or pairs). Q1: How many variants are there? (hint: do not forget to remove the header) Q2: How many variants are phased? (hint: remove the header and look at the 10th column (the genotype info) using cut).
Q3: Consider rs4364082 and rs6051444 (hint search using grep). Why are these variants phased?
Align PacBio reads
First, let's have a look at PacBio data:
/home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz
Q4: What do you notice?
Let's align to hg38+sort:
/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini [input fastq here] |samtools view -uS - |samtools sort /dev/stdin > [output BAM here]
-a forces sam output which is converted to bam and sorted. Let's index:
samtools index [bam file]
Q5: What is the average read length? (hint: awk '{print length($10)}' prints the length of the 10th field, the sequences. hint2: to compute the average of the first column of numbers use: awk '{sum+=$1; n++} END {if(n>0) print sum/n})
Use bioconda to install software
Bioconda is a game-changer for anyone starting bioinformatics. It allows you to install software very easily and offers a vast repository of bioinformatics tools. It solves the problem of you needing library A to install software B and needing C to build A etc. All you need is to setup an "environment" where your software will be installed. An "environment" is a directory in your home dir where the software and its depencies will be installed.
Beware! The behavior of several commands like python will not be the same as it will use the python from your environment.
Let's install WhatsHap through bioconda. First, let's create an environment called whatshap-env and install whatshap:
/home/ctools/bin/conda create -n whatshap-env bioconda::whatshap
Then init the environment:
/home/ctools/bin/conda init bash
Log out and log back in
Activate the environment:
conda activate whatshap-env
Check the installation:
whatshap --help
Q6: Was that easy?
Phase variants using WhatsHap
Then let's phase our variants:
whatshap phase --ignore-read-groups --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o [output vcf] [input vcf] [long reads bam]
Q7 How many extra variants are phased?
To deactivate conda write:
conda deactivate
You did not like conda? Do not forget to remove the following from your ~/.bashrc:
# >>> conda initialize >>> # !! Contents within this block are managed by 'conda init' !! __conda_setup="$('/home/ctools/anaconda3_2021.11/bin/conda' 'shell.bash' 'hook' 2> /dev/null)" if [ $? -eq 0 ]; then eval "$__conda_setup" else if [ -f "/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh" ]; then . "/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh" else export PATH="/home/ctools/anaconda3_2021.11/bin:$PATH" fi fi
and remove ~/.conda/:
rm -rfv ~/.conda/
Please find the answers here
Congratulations you finished the exercise!