<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://teaching.healthtech.dtu.dk/22126/index.php?action=history&amp;feed=atom&amp;title=Longread_exercise</id>
	<title>Longread exercise - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://teaching.healthtech.dtu.dk/22126/index.php?action=history&amp;feed=atom&amp;title=Longread_exercise"/>
	<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22126/index.php?title=Longread_exercise&amp;action=history"/>
	<updated>2026-04-15T02:39:35Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22126/index.php?title=Longread_exercise&amp;diff=39&amp;oldid=prev</id>
		<title>WikiSysop: Created page with &quot;&lt;H2&gt;Overview&lt;/H2&gt;  First: &lt;OL&gt; &lt;LI&gt;Navigate to your home directory: &lt;LI&gt;Create a directory called &quot;longread&quot;  &lt;LI&gt;Navigate to the directory you just created. &lt;/OL&gt;  We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app).  First, what is phasing?   Phasing means that we determine which base is on the same chromosome as another base for neighboring variants. Let&#039;s consider a small example with just two varia...&quot;</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22126/index.php?title=Longread_exercise&amp;diff=39&amp;oldid=prev"/>
		<updated>2024-03-19T15:41:18Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;&amp;lt;H2&amp;gt;Overview&amp;lt;/H2&amp;gt;  First: &amp;lt;OL&amp;gt; &amp;lt;LI&amp;gt;Navigate to your home directory: &amp;lt;LI&amp;gt;Create a directory called &amp;quot;longread&amp;quot;  &amp;lt;LI&amp;gt;Navigate to the directory you just created. &amp;lt;/OL&amp;gt;  We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app).  First, what is phasing?   Phasing means that we determine which base is on the same chromosome as another base for neighboring variants. Let&amp;#039;s consider a small example with just two varia...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;lt;H2&amp;gt;Overview&amp;lt;/H2&amp;gt;&lt;br /&gt;
&lt;br /&gt;
First:&lt;br /&gt;
&amp;lt;OL&amp;gt;&lt;br /&gt;
&amp;lt;LI&amp;gt;Navigate to your home directory:&lt;br /&gt;
&amp;lt;LI&amp;gt;Create a directory called &amp;quot;longread&amp;quot; &lt;br /&gt;
&amp;lt;LI&amp;gt;Navigate to the directory you just created.&lt;br /&gt;
&amp;lt;/OL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app).&lt;br /&gt;
&lt;br /&gt;
First, what is phasing? &lt;br /&gt;
&lt;br /&gt;
Phasing means that we determine which base is on the same chromosome as another base for neighboring variants.&lt;br /&gt;
Let&amp;#039;s consider a small example with just two variants (single nucleotide polymorphisms or SNPs) to illustrate phasing:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;OL&amp;gt;&lt;br /&gt;
&amp;lt;LI&amp;gt; SNP1: Located on chromosome 1 at position 1000. The individual is heterozygous A or G. &lt;br /&gt;
&amp;lt;LI&amp;gt; SNP2: Located on chromosome 1 at position 2000. The individual is heterozygous C or T.&lt;br /&gt;
&amp;lt;/OL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Great! but do we have: &lt;br /&gt;
&amp;lt;OL&amp;gt;&lt;br /&gt;
&amp;lt;LI&amp;gt; A and C on the same chromosome and G and T on the other chromosome&lt;br /&gt;
&amp;lt;LI&amp;gt; A and T on the same chromosome and G and C on the other chromosome&lt;br /&gt;
&amp;lt;/OL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Without phasing we don&amp;#039;t have this information. This is important because phasing informs us about the phenotypic (therefore for health and reaction to drugs/treatments) consequences of the different bases.&lt;br /&gt;
&lt;br /&gt;
In a VCF, unphased variants will appear like this:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
chr1     1000    rs123  A     G      29    PASS    INFO           GT  0/1&lt;br /&gt;
chr1     2000    rs456  C     T      29    PASS    INFO           GT  0/1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;0/1&amp;#039;&amp;#039;&amp;#039; means heterozygous reference+alternative.&lt;br /&gt;
&lt;br /&gt;
Now, if A and C are on the same chromosome, phased variants can appear as:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
chr1     1000    rs123  A     G      29    PASS    INFO           GT  0|1&lt;br /&gt;
chr1     2000    rs456  C     T      29    PASS    INFO           GT  0|1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
but if A and C are on different chromosomes, phased variants can appear as:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
chr1     1000    rs123  A     G      29    PASS    INFO           GT  0|1&lt;br /&gt;
chr1     2000    rs456  C     T      29    PASS    INFO           GT  1|0&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this exercise, we will:&lt;br /&gt;
&amp;lt;OL&amp;gt;&lt;br /&gt;
&amp;lt;LI&amp;gt; Do standard genotyping using BGI sequencing from an [https://en.wikipedia.org/wiki/Ashkenazi_Jews Ashkenazi] individual&lt;br /&gt;
&amp;lt;LI&amp;gt; Align long reads from PacBio&lt;br /&gt;
&amp;lt;LI&amp;gt; Learn how to install software using [https://bioconda.github.io/ bioconda]&lt;br /&gt;
&amp;lt;LI&amp;gt; Use the long reads to phase our variants&lt;br /&gt;
&amp;lt;/OL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;H2&amp;gt;Genotyping with BGI reads&amp;lt;/H2&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The reads are here:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz&lt;br /&gt;
/home/projects/22126_NGS/exercises/long_reads/BGI2.fq.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
They do not have adapters. As we have previously covered aligning and genotyping, you can copy paste the commands, just make sure you understand what you are doing. First, let&amp;#039;s have a look at BGI-Seq data:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
zcat /home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz |head&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You will notice it is very much like Illumina in terms of read length and encoding.&lt;br /&gt;
&lt;br /&gt;
Just go ahead and align them using bwa mem and sort them:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bwa mem -R &amp;quot;@RG\tID:HG002\tSM:HG002&amp;quot; -t 10 /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa  BGI1.fq.gz BGI2.fq.gz |samtools view -uS - |samtools sort /dev/stdin &amp;gt; BGI_hg38.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Let&amp;#039;s remove duplicates:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I BGI_hg38.bam -M BGI_hg38_metrics.txt -O BGI_hg38_rmdup.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
then index:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
samtools index BGI_hg38_rmdup.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Let&amp;#039;s genotype:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
gatk --java-options &amp;quot;-Xmx10g&amp;quot; HaplotypeCaller    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa    -I  BGI_hg38_rmdup.bam  -L chr20:2000000-3000000  -O  BGI_hg38_chr20.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz  -ERC GVCF&lt;br /&gt;
gatk  IndexFeatureFile -I BGI_hg38_chr20.gvcf.gz&lt;br /&gt;
gatk  GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V BGI_hg38_chr20.gvcf.gz  -O BGI_hg38_chr20.vcf.gz  -L chr20:2000000-3000000 --dbsnp  /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Notice that GATK will phase variants on the same read (or pairs). &lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q1&amp;#039;&amp;#039;&amp;#039;: How many variants are there? (hint: do not forget to remove the header)&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q2&amp;#039;&amp;#039;&amp;#039;: How many variants are phased? (hint: remove the header and look at the 10th column (the genotype info) using cut).&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q3&amp;#039;&amp;#039;&amp;#039;: Consider rs4364082 and rs6051444 (hint search using grep). Why are these variants phased?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;H2&amp;gt;Align PacBio reads&amp;lt;/H2&amp;gt;&lt;br /&gt;
&lt;br /&gt;
First, let&amp;#039;s have a look at PacBio data:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q4&amp;#039;&amp;#039;&amp;#039;: What do you notice?&lt;br /&gt;
&lt;br /&gt;
Let&amp;#039;s align to hg38+sort:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a   /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini  [input fastq here] |samtools view -uS - |samtools sort /dev/stdin &amp;gt; [output BAM here]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
-a forces sam output which is converted to bam and sorted.&lt;br /&gt;
Let&amp;#039;s index:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
samtools index [bam file]&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q5&amp;#039;&amp;#039;&amp;#039;: What is the average read length? (hint: awk &amp;#039;{print length($10)}&amp;#039; prints the length of the 10th field, the sequences.  hint2: to compute the average of the first column of numbers use: awk &amp;#039;{sum+=$1; n++} END {if(n&amp;gt;0) print sum/n})&lt;br /&gt;
&lt;br /&gt;
&amp;lt;H2&amp;gt;Use bioconda to install software&amp;lt;/H2&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Bioconda is a game-changer for anyone starting bioinformatics. It allows you to install software very easily and offers a vast repository of bioinformatics tools. It solves the problem of you needing library A to install software B and needing C to build A etc. All you need is to setup an &amp;quot;environment&amp;quot; where your software will be installed. An &amp;quot;environment&amp;quot; is a directory in your home dir where the software and its depencies will be installed.&lt;br /&gt;
&lt;br /&gt;
Beware! The behavior of several commands like python will not be the same as it will use the python from your environment.&lt;br /&gt;
&lt;br /&gt;
Let&amp;#039;s install WhatsHap through bioconda. First, let&amp;#039;s create an environment called whatshap-env and install whatshap:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/home/ctools/bin/conda create -n whatshap-env bioconda::whatshap&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then init the environment:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/home/ctools/bin/conda init bash&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Log out and log back in&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
Activate the environment:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
conda activate whatshap-env&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Check the installation:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
whatshap --help&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q6&amp;#039;&amp;#039;&amp;#039;: Was that easy?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;H2&amp;gt;Phase variants using WhatsHap&amp;lt;/H2&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then let&amp;#039;s phase our variants:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
whatshap phase  --ignore-read-groups  --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o [output vcf] [input vcf] [long reads bam] &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q7&amp;#039;&amp;#039;&amp;#039; How many extra variants are phased?&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
To deactivate conda write:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
conda deactivate&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
You did not like conda? Do not forget to remove the following from your ~/.bashrc:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
# &amp;gt;&amp;gt;&amp;gt; conda initialize &amp;gt;&amp;gt;&amp;gt;&lt;br /&gt;
# !! Contents within this block are managed by &amp;#039;conda init&amp;#039; !!&lt;br /&gt;
__conda_setup=&amp;quot;$(&amp;#039;/home/ctools/anaconda3_2021.11/bin/conda&amp;#039; &amp;#039;shell.bash&amp;#039; &amp;#039;hook&amp;#039; 2&amp;gt; /dev/null)&amp;quot;&lt;br /&gt;
if [ $? -eq 0 ]; then&lt;br /&gt;
    eval &amp;quot;$__conda_setup&amp;quot;&lt;br /&gt;
else&lt;br /&gt;
    if [ -f &amp;quot;/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh&amp;quot; ]; then&lt;br /&gt;
        . &amp;quot;/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh&amp;quot;&lt;br /&gt;
    else&lt;br /&gt;
        export PATH=&amp;quot;/home/ctools/anaconda3_2021.11/bin:$PATH&amp;quot;&lt;br /&gt;
    fi&lt;br /&gt;
fi&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
and remove ~/.conda/:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
rm -rfv ~/.conda/&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Please find the answers [[Longread_exercise_answers|here]] &lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Congratulations you finished the exercise!&amp;#039;&amp;#039;&amp;#039;&lt;/div&gt;</summary>
		<author><name>WikiSysop</name></author>
	</entry>
</feed>