WikiSysop: Created page with "

Overview

First:

Navigate to your home directory:
Create a directory called "longread"
Navigate to the directory you just created.

We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app). First, what is phasing? Phasing means that we determine which base is on the same chromosome as another base for neighboring variants. Let's consider a small example with just two varia..."

2024-03-19T15:41:18Z

Created page with "<H2>Overview</H2> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "longread" <LI>Navigate to the directory you just created. </OL> We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app). First, what is phasing? Phasing means that we determine which base is on the same chromosome as another base for neighboring variants. Let's consider a small example with just two varia..."

New page

<H2>Overview</H2>

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "longread"
<LI>Navigate to the directory you just created.
</OL>

We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app).

First, what is phasing?

Phasing means that we determine which base is on the same chromosome as another base for neighboring variants.
Let's consider a small example with just two variants (single nucleotide polymorphisms or SNPs) to illustrate phasing:

<OL>
<LI> SNP1: Located on chromosome 1 at position 1000. The individual is heterozygous A or G.
<LI> SNP2: Located on chromosome 1 at position 2000. The individual is heterozygous C or T.
</OL>

Great! but do we have:
<OL>
<LI> A and C on the same chromosome and G and T on the other chromosome
<LI> A and T on the same chromosome and G and C on the other chromosome
</OL>

Without phasing we don't have this information. This is important because phasing informs us about the phenotypic (therefore for health and reaction to drugs/treatments) consequences of the different bases.

In a VCF, unphased variants will appear like this:
<pre>
chr1 1000 rs123 A G 29 PASS INFO GT 0/1
chr1 2000 rs456 C T 29 PASS INFO GT 0/1
</pre>

'''0/1''' means heterozygous reference+alternative.

Now, if A and C are on the same chromosome, phased variants can appear as:
<pre>
chr1 1000 rs123 A G 29 PASS INFO GT 0|1
chr1 2000 rs456 C T 29 PASS INFO GT 0|1
</pre>

but if A and C are on different chromosomes, phased variants can appear as:
<pre>
chr1 1000 rs123 A G 29 PASS INFO GT 0|1
chr1 2000 rs456 C T 29 PASS INFO GT 1|0
</pre>

In this exercise, we will:
<OL>
<LI> Do standard genotyping using BGI sequencing from an [https://en.wikipedia.org/wiki/Ashkenazi_Jews Ashkenazi] individual
<LI> Align long reads from PacBio
<LI> Learn how to install software using [https://bioconda.github.io/ bioconda]
<LI> Use the long reads to phase our variants
</OL>

<H2>Genotyping with BGI reads</H2>

The reads are here:
<pre>
/home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz
/home/projects/22126_NGS/exercises/long_reads/BGI2.fq.gz
</pre>

They do not have adapters. As we have previously covered aligning and genotyping, you can copy paste the commands, just make sure you understand what you are doing. First, let's have a look at BGI-Seq data:
<pre>
zcat /home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz |head
</pre>

You will notice it is very much like Illumina in terms of read length and encoding.

Just go ahead and align them using bwa mem and sort them:

<pre>
bwa mem -R "@RG\tID:HG002\tSM:HG002" -t 10 /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa BGI1.fq.gz BGI2.fq.gz |samtools view -uS - |samtools sort /dev/stdin > BGI_hg38.bam
</pre>

Let's remove duplicates:
<pre>
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I BGI_hg38.bam -M BGI_hg38_metrics.txt -O BGI_hg38_rmdup.bam
</pre>

then index:
<pre>
samtools index BGI_hg38_rmdup.bam
</pre>

Let's genotype:
<pre>
gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I BGI_hg38_rmdup.bam -L chr20:2000000-3000000 -O BGI_hg38_chr20.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF
gatk IndexFeatureFile -I BGI_hg38_chr20.gvcf.gz
gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V BGI_hg38_chr20.gvcf.gz -O BGI_hg38_chr20.vcf.gz -L chr20:2000000-3000000 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>

Notice that GATK will phase variants on the same read (or pairs).
'''Q1''': How many variants are there? (hint: do not forget to remove the header)
'''Q2''': How many variants are phased? (hint: remove the header and look at the 10th column (the genotype info) using cut).

'''Q3''': Consider rs4364082 and rs6051444 (hint search using grep). Why are these variants phased?

<H2>Align PacBio reads</H2>

First, let's have a look at PacBio data:

<pre>
/home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz
</pre>

'''Q4''': What do you notice?

Let's align to hg38+sort:
<pre>
/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini [input fastq here] |samtools view -uS - |samtools sort /dev/stdin > [output BAM here]
</pre>

-a forces sam output which is converted to bam and sorted.
Let's index:
<pre>
samtools index [bam file]
</pre>

'''Q5''': What is the average read length? (hint: awk '{print length($10)}' prints the length of the 10th field, the sequences. hint2: to compute the average of the first column of numbers use: awk '{sum+=$1; n++} END {if(n>0) print sum/n})

<H2>Use bioconda to install software</H2>

Bioconda is a game-changer for anyone starting bioinformatics. It allows you to install software very easily and offers a vast repository of bioinformatics tools. It solves the problem of you needing library A to install software B and needing C to build A etc. All you need is to setup an "environment" where your software will be installed. An "environment" is a directory in your home dir where the software and its depencies will be installed.

Beware! The behavior of several commands like python will not be the same as it will use the python from your environment.

Let's install WhatsHap through bioconda. First, let's create an environment called whatshap-env and install whatshap:

<pre>
/home/ctools/bin/conda create -n whatshap-env bioconda::whatshap
</pre>

Then init the environment:
<pre>
/home/ctools/bin/conda init bash
</pre>

'''Log out and log back in'''

Activate the environment:
<pre>
conda activate whatshap-env
</pre>

Check the installation:
<pre>
whatshap --help
</pre>

'''Q6''': Was that easy?

<H2>Phase variants using WhatsHap</H2>

Then let's phase our variants:
<pre>
whatshap phase --ignore-read-groups --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o [output vcf] [input vcf] [long reads bam]
</pre>

'''Q7''' How many extra variants are phased?

To deactivate conda write:
<pre>
conda deactivate
</pre>

You did not like conda? Do not forget to remove the following from your ~/.bashrc:

<pre>
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/ctools/anaconda3_2021.11/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh" ]; then
. "/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh"
else
export PATH="/home/ctools/anaconda3_2021.11/bin:$PATH"
fi
fi
</pre>

and remove ~/.conda/:

<pre>
rm -rfv ~/.conda/
</pre>

Please find the answers [[Longread_exercise_answers|here]]

'''Congratulations you finished the exercise!'''

Longread exercise - Revision history

WikiSysop: Created page with "

Overview