Longread exercise

From 22126
Revision as of 16:41, 19 March 2024 by WikiSysop (talk | contribs) (Created page with "<H2>Overview</H2> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "longread" <LI>Navigate to the directory you just created. </OL> We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app). First, what is phasing? Phasing means that we determine which base is on the same chromosome as another base for neighboring variants. Let's consider a small example with just two varia...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

First:

  1. Navigate to your home directory:
  2. Create a directory called "longread"
  3. Navigate to the directory you just created.

We will phase some variants using WhatsHap (no not the messaging app).

First, what is phasing?

Phasing means that we determine which base is on the same chromosome as another base for neighboring variants. Let's consider a small example with just two variants (single nucleotide polymorphisms or SNPs) to illustrate phasing:

  1. SNP1: Located on chromosome 1 at position 1000. The individual is heterozygous A or G.
  2. SNP2: Located on chromosome 1 at position 2000. The individual is heterozygous C or T.

Great! but do we have:

  1. A and C on the same chromosome and G and T on the other chromosome
  2. A and T on the same chromosome and G and C on the other chromosome

Without phasing we don't have this information. This is important because phasing informs us about the phenotypic (therefore for health and reaction to drugs/treatments) consequences of the different bases.

In a VCF, unphased variants will appear like this:

chr1     1000    rs123  A     G      29    PASS    INFO           GT  0/1
chr1     2000    rs456  C     T      29    PASS    INFO           GT  0/1

0/1 means heterozygous reference+alternative.

Now, if A and C are on the same chromosome, phased variants can appear as:

chr1     1000    rs123  A     G      29    PASS    INFO           GT  0|1
chr1     2000    rs456  C     T      29    PASS    INFO           GT  0|1

but if A and C are on different chromosomes, phased variants can appear as:

chr1     1000    rs123  A     G      29    PASS    INFO           GT  0|1
chr1     2000    rs456  C     T      29    PASS    INFO           GT  1|0


In this exercise, we will:

  1. Do standard genotyping using BGI sequencing from an Ashkenazi individual
  2. Align long reads from PacBio
  3. Learn how to install software using bioconda
  4. Use the long reads to phase our variants


Genotyping with BGI reads

The reads are here:

/home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz
/home/projects/22126_NGS/exercises/long_reads/BGI2.fq.gz

They do not have adapters. As we have previously covered aligning and genotyping, you can copy paste the commands, just make sure you understand what you are doing. First, let's have a look at BGI-Seq data:

zcat /home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz |head

You will notice it is very much like Illumina in terms of read length and encoding.

Just go ahead and align them using bwa mem and sort them:

bwa mem -R "@RG\tID:HG002\tSM:HG002" -t 10 /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa  BGI1.fq.gz BGI2.fq.gz |samtools view -uS - |samtools sort /dev/stdin > BGI_hg38.bam


Let's remove duplicates:

java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I BGI_hg38.bam -M BGI_hg38_metrics.txt -O BGI_hg38_rmdup.bam

then index:

samtools index BGI_hg38_rmdup.bam

Let's genotype:

gatk --java-options "-Xmx10g" HaplotypeCaller    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa    -I  BGI_hg38_rmdup.bam  -L chr20:2000000-3000000  -O  BGI_hg38_chr20.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz  -ERC GVCF
gatk  IndexFeatureFile -I BGI_hg38_chr20.gvcf.gz
gatk  GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V BGI_hg38_chr20.gvcf.gz  -O BGI_hg38_chr20.vcf.gz  -L chr20:2000000-3000000 --dbsnp  /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz 

Notice that GATK will phase variants on the same read (or pairs). Q1: How many variants are there? (hint: do not forget to remove the header) Q2: How many variants are phased? (hint: remove the header and look at the 10th column (the genotype info) using cut).

Q3: Consider rs4364082 and rs6051444 (hint search using grep). Why are these variants phased?


Align PacBio reads

First, let's have a look at PacBio data:

/home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz

Q4: What do you notice?

Let's align to hg38+sort:

/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a   /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini  [input fastq here] |samtools view -uS - |samtools sort /dev/stdin > [output BAM here]

-a forces sam output which is converted to bam and sorted. Let's index:

samtools index [bam file]

Q5: What is the average read length? (hint: awk '{print length($10)}' prints the length of the 10th field, the sequences. hint2: to compute the average of the first column of numbers use: awk '{sum+=$1; n++} END {if(n>0) print sum/n})

Use bioconda to install software

Bioconda is a game-changer for anyone starting bioinformatics. It allows you to install software very easily and offers a vast repository of bioinformatics tools. It solves the problem of you needing library A to install software B and needing C to build A etc. All you need is to setup an "environment" where your software will be installed. An "environment" is a directory in your home dir where the software and its depencies will be installed.

Beware! The behavior of several commands like python will not be the same as it will use the python from your environment.

Let's install WhatsHap through bioconda. First, let's create an environment called whatshap-env and install whatshap:

/home/ctools/bin/conda create -n whatshap-env bioconda::whatshap

Then init the environment:

/home/ctools/bin/conda init bash

Log out and log back in

Activate the environment:

conda activate whatshap-env

Check the installation:

whatshap --help

Q6: Was that easy?


Phase variants using WhatsHap

Then let's phase our variants:

whatshap phase  --ignore-read-groups  --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o [output vcf] [input vcf] [long reads bam] 

Q7 How many extra variants are phased?


To deactivate conda write:

conda deactivate

You did not like conda? Do not forget to remove the following from your ~/.bashrc:

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/ctools/anaconda3_2021.11/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh" ]; then
        . "/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh"
    else
        export PATH="/home/ctools/anaconda3_2021.11/bin:$PATH"
    fi
fi

and remove ~/.conda/:

rm -rfv ~/.conda/



Please find the answers here

Congratulations you finished the exercise!