22126 - User contributions [en]

Program 2019

2024-03-19T16:10:44Z

WikiSysop: Created page with "'''REMEMBER TO BRING A LAPTOP COMPUTER FOR EXERCISES''' Lectures and exercises will take place at the Section for Bioinformatics at the Technical University of Denmark, '''Building 208, Room 062''' in Lyngby ([https://www.google.com/maps/place/DTU+bygning+208%2F+DTU+Building+208/@55.7863894,12.520095,357m/data=!3m1!1e3!4m5!3m4!1s0x46524e62f280d429:0x1bbb0474519984b5!8m2!3d55.787537!4d12.5182208 map]). The courses has two main parts, the first half is lectures and exerc..."

'''REMEMBER TO BRING A LAPTOP COMPUTER FOR EXERCISES'''

Lectures and exercises will take place at the Section for Bioinformatics at the Technical University of Denmark, '''Building 208, Room 062''' in Lyngby ([https://www.google.com/maps/place/DTU+bygning+208%2F+DTU+Building+208/@55.7863894,12.520095,357m/data=!3m1!1e3!4m5!3m4!1s0x46524e62f280d429:0x1bbb0474519984b5!8m2!3d55.787537!4d12.5182208 map]).

The courses has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on Thursday 27th of June.

This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza [https://piazza.com/danish_technical_university/summer2019/1 here]

=== Course Program - June 2019 ===

<hr>
'''Thursday, June 6'''
<hr>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9.00-9.30
<DD>Introduction to course & Pre-test
([http://teaching.healthtech.dtu.dk/material/22126/introduction_to_course.pdf Lecture slides])
([http://teaching.healthtech.dtu.dk/material/22126/pre-test_2019.pdf Pre-test])</dd>
<DD>Gisle Vestergaard

<DT>9.30-10.00
<dd>Introduction to NGS
([http://teaching.healthtech.dtu.dk/material/22126/Introduction_to_NGS.pdf Lecture slides]) </dd>
<DD>Gisle Vestergaard

<DT>10.00-10.45
<dd>2nd and 3rd generation NGS Technologies
([http://teaching.healthtech.dtu.dk/material/22126/Introduction_to_NGS_technology.pdf Lecture slides])
<DD>Gisle Vestergaard

<DT>10.45-11.00
<DD>''Break''

<DT>11.00-12.00
<DD>Tech talk group formation and group work
([http://teaching.healthtech.dtu.dk/material/22126/Tech_Talks.pdf Lecture slides])
([https://docs.google.com/spreadsheets/d/1q8wUa-Imiig6H6TloSSjdTuCiCvU7lAmvf972VWl0vs/edit#gid=0 Student Groups]) </dd>
<DD>Gisle Vestergaard

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-13.30
<DD>Exercise: Logging on to our Computerome cloud
([[Logging on to Cloud system]])</dd>
<DD>Gisle Vestergaard, Katrine Højholt

<DT>13.30-14.15
<DD>Introduction to UNIX
([http://teaching.healthtech.dtu.dk/material/36610/UnixIntroduction.ppt Lecture slides])
([http://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercise])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])
<DD>Peter Wad Sackett, Gisle Vestergaard

<DT>14.15-14.30
<DD>''Break''

<DT>14.30-15.30
<DD>Introduction to UNIX
([http://teaching.healthtech.dtu.dk/material/36610/UnixIntroduction.ppt Lecture slides])
([http://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercise])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])
<DD>Peter Wad Sackett, Gisle Vestergaard, Katrine Højholt

<DT>15.30-16.00
<DD>First look at data
([[First look exercise]])
<DD>Gisle Vestergaard, Katrine Højholt

</DD>
<BR>
</DL>
<HR>
'''Friday, June 7'''
<HR>
''Data pre-processing & Genomic Epidemiology''
<DL>
<DT>9.00-10.45
<DD>Data basics
([http://teaching.healthtech.dtu.dk/material/22126/Data_Basics.pdf Lecture slides])
([[Data basics exercise]])
([[Data basics solution]])
<DD>Gisle Vestergaard, Katrine Højholt

<DT>10.45-11.00
<DD>''Break''

<DT>11.00-12.00
<DD>Data pre-processing
([http://teaching.healthtech.dtu.dk/material/22126/Data_preprocessing.pdf Lecture slides])
([[Data Preprocess exercise]])
([[Data Preprocess solution]])
<DD>Katrine Højholt

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-14.00
<DD>Exercise: Genomic Epidemiology
([[Genomic epidemiology exercise]])
([[Genomic epidemiology solution]])
<DD>Gisle Vestergaard, Katrine Højholt

<DT>14.00-16.00
<DD>Case story: Genomic Epidemiology
([http://teaching.healthtech.dtu.dk/material/22126/Genomic_epidemiology_NGScourse_Pimlapas_June2018.pdf Lecture])
<DD>Pimlapas Leekitecharoenphon (Shinny)
</DD>
<BR>
</DL>
<HR>
'''Monday, June 10'''
<HR>
''Whit Monday - No teaching!''
<br><br>
<HR>
<B>Tuesday, June 11</B> <BR>
<HR>
''Alignment & Genotyping''

<DL>

<DT>9.00-9.45
<DD>Alignment
([http://teaching.healthtech.dtu.dk/material/22126/Alignment.pdf Lecture slides])
<DD>Gisle Vestergaard

<DT>9.45-10.00
<DD>''Break''

<DT>10.00-12.00
<DD>Exercise: Alignment
([[Alignment exercise]])
([[Alignment solution]])
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-13.25
<DD>Functional Human Variation
<DD>Jose Izarzugaza (Txema) ([http://teaching.healthtech.dtu.dk/material/22126/Intro_to_variation.pdf Lecture slides])

<DT>13.25-14.00
<DD>Alignment postprocessing & variant calling
([http://teaching.healthtech.dtu.dk/material/22126/post_alignment_variantcalling.pdf Lecture slides])

<DD>Gisle Vestergaard

<DT>14.00-14.15
<DD>''Break''

<DT>14.15-16.00
<DD>Exercise: Postprocessing & variant calling
([[Postprocess exercise]])
([[Postprocess solution]])
([[SNP calling exercise]])
([[SNP calling solution]]) </dd>
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt
</DD>

</DL>
<HR>
'''Wednesday, June 12'''
<HR>
''de novo assembly & Metagenomics''
<DL>

<DT>9.00-9.45
<DD>Lecture: de novo assembly
([http://teaching.healthtech.dtu.dk/material/22126/de_novo_assembly_course.pdf Lecture slides])
([http://teaching.healthtech.dtu.dk/material/22126/debruijn_handout.pdf Handout]) </dd>
<DD>Gisle Vestergaard

<DT>9.45-10.00
<DD>''Break''

<DT>10.00-12.00
<DD>Exercise: de novo assembly
([[denovo exercise]])
([[denovo solution]]) </dd>
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-13.45
<DD>Metagenomics & Binning
([http://teaching.healthtech.dtu.dk/material/22126/Metagenomics_assembly.pdf Lecture slides])
</dd>
<DD>Jakob Nissen

<DT>14.00-16.00
<DD>Exercise: Metagenomic de novo assembly
([[Metagenomic assembly exercise]])
([[Metagenomic assembly solution]]) </dd>
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

</DD>
<BR>
</DL>
<HR>
'''Thursday, June 13'''
<HR>
''Quantitative Metagenomics and Test''
<DL>

<DT>9.00-9.45
<DD>Lecture: Quantitative Metagenomics
([http://teaching.healthtech.dtu.dk/material/22126/quantitative_metagenomics.pdf Lecture slides])</dd>
<DD>Gisle Vestergaard

<DT>9.45-10.00
<DD>''Break''

<DT>10.00-12.00
<DD>Exercise: Quantitative Metagenomics
([http://teaching.healthtech.dtu.dk/22126/index.php/QuantitativeMetagenomics Exercise])
([http://teaching.healthtech.dtu.dk/22126/index.php/QuantitativeMetagenomicsSolution Solution]) </dd>
<DD>Gisle Vestergaard, Jakob Nissen

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-13.45
<DD>Recap Test
([http://teaching.healthtech.dtu.dk/material/22126/Recap_test_2019.pdf Test])

<DT>13.45-14.00
<DD>''Break''

<DT>14.00-16.00
<DD>Time to work on this weeks exercises
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

</DD>
<BR>
</DL>
<HR>
'''Friday, June 14'''
<HR>
''RNA-seq and Cancer-seq''
<DT>9.00-9.45
<DD>Lecture: RNAseq
([http://teaching.healthtech.dtu.dk/material/22126/NGS_RNA-seq_2019.pdf Lecture slides])</dd>
<DD>Fransesca Bertolini

<DT>9.45-10.00
<DD>''Break''

<DT>10.00-12.00
<DD>Exercise: RNAseq
([http://teaching.healthtech.dtu.dk/material/22126/RNAseq_exercise_2019_new.docx Exercise])
([http://teaching.healthtech.dtu.dk/material/22126/RNAseq_answers_2019_new.txt Solution]) </dd>
<DD>Francesca Bertoloni, Gisle Vestergaard, Katrine Højholt, Jakob Nissen

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-13.45
<DD>Lecture: Cancer-seq
([http://teaching.healthtech.dtu.dk/material/22126/CancerGenomics_Izarzugaza.pdf Lecture slides])
<DD>Jose Izarzugaza (Txema)

<DT>14.00-16.00
<DD>Exercise: Cancer-seq
([https://github.com/aroneklund/DTU-27626-cancer Exercise])
<DD>Jose Izarzugaza (Txema), Gisle Vestergaard

</DD>
<HR>
'''Monday, June 17'''
<HR>
''Ancient DNA & Tech talks''
<DL>
<DT>9.00-10.00
<DD>Ancient DNA
([http://teaching.healthtech.dtu.dk/material/22126/dtu_adna.pdf Lecture slides])
<DD>Martin Sikora

<DT>10.00-12.00
<DD>Exercise: Ancient DNA
([http://teaching.healthtech.dtu.dk/material/22126/adna_practical.txt Exercise])

<DD>Martin Sikora, Gisle Vestergaard

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-16.00
<DD>Tech talk work & [http://teaching.healthtech.dtu.dk/material/22126/TechTalks.tar Presentations]
<DD>Gisle Vestergaard

</DD>
<BR>
</DL><HR>
'''Tuesday, June 18'''
<HR>
<P>''Project work''<BR>
<DL>

<DT>9.00-9.45
<DD>Projects & Group formation
([http://teaching.healthtech.dtu.dk/material/22126/Projects.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/posters.tar.gz Examples from previous courses])
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

<DT>9.45-10.45
<DD>Projects & Group formation
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

<DT>10.45-11.00
<DD>''Break''

<DT>11.00-12.00
<DD>Project work
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-16.00
<DD>Project work/Prepare presentations for tomorrow

</DD>
<BR>
</DL>
<HR>
'''Wednesday, June 19'''
<HR>
''Project work & Project Presentations''
<DL>

<DT>9.00-12.00
<DD>Project work
<DD>Gisle Vestergaard, Katrine Højholt

<DT>12.00-13.00
<DD>''Lunch Break''

<DT>13.00-14.00
<DD>Project Presentations
<DD>Gisle Vestergaard

</DD>
<BR>
</DL>
<HR>
'''Thursday, June 20'''
<HR>
''Project work''
<DL>

<DT>9.00-16.00
<DD>Project work
<DD>

<DT>13.00-15.00
<DD>Project work/Office hours
<DD>Gisle Vestergaard, Katrine Højholt

</DD>
<BR>
</DL>
<HR>
'''Friday, June 21'''
<HR>
''Project work''
<DL>

<DT>9.00-16.00
<DD>Project work
<DD>

<DT>13.00-15.00
<DD>Project work/Office hours
<DD>Jose Izarzugaza (Txema), Katrine Højholt

</DD>
<BR>
</DL>
<HR>
'''Monday, June 24'''
<HR>
''Project work''
<DL>

<DT>9.00-16.00
<DD>Project work
<DD>

<DT>13.00-15.00
<DD>Project work/Office hours
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

</DD>
<BR>
</DL>
<HR>
'''Tuesday, June 25'''
<HR>
''Project work''
<DL>

<DT>9.00-16.00
<DD>Project work
<DD>

<DT>13.00-15.00
<DD>Project work/Office hours
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

</DD>
<BR>
</DL>
<HR>
'''Wednesday, June 26'''
<HR>
''Project Work & Poster Printing''
<DL>

<DD>Print your poster, it is possible to print in DTU library for 16-30kr.
<DD>[http://teaching.healthtech.dtu.dk/material/22126/Posters.pdf Poster guide & requirements]

<DT>10.00-12.00
<DD>Q&A: Practical information about the [http://teaching.healthtech.dtu.dk/material/22126/exam.pdf Exam]
<DD>Project work/Office hours
<DD>Gisle Vestergaard, Jakob Nissen, Katrine Højholt

</DL>
<br>
<HR>
<B>Thursday, June 27</B> <BR>
<HR>
''Poster session - in the hall in front of the Section for Bioinformatics (building 208)''
<DL>

<DT>10.00-14.00
<DD>Poster session part ('''exam''')

</DD>
<BR>
</DL><HR>

Metagenomic assembly solution

2024-03-19T16:09:45Z

WikiSysop: Created page with "Q1. We cant really find the bell shaped distributions in our samples - except for in MH0032 where there are two very small bell shaped coverage distributions. Q2. This is because we have many organisms with relative low abundance. This makes it very hard to distinguish them using coverage information. Q3. It will probably perform much like SOAPdenovo. Q4. There arent really any major differences in the coverage distributions between the assemblers, the metagenome asse..."

Q1. We cant really find the bell shaped distributions in our samples - except for in MH0032 where there are two very small bell shaped coverage distributions.

Q2. This is because we have many organisms with relative low abundance. This makes it very hard to distinguish them using coverage information.

Q3. It will probably perform much like SOAPdenovo.

Q4. There arent really any major differences in the coverage distributions between the assemblers, the metagenome assemblers are having as many problems as the standard assembler. The coverage distribution tells us that by far the most contigs have fairly low coverage in the assembly and this is also what we expect.

Q5. There arent really any major differences between the assemblies from MetaVelvet and Soapdenovo. However the Megahit assembly actually seems to be as long as the other assemblies, while having a larger mean scaffold size and a longer N50, meaning that it was able to assemble the metagenome into longer pieces.

Q6. MH0032: 84606 ; MH0047: around 70,000. 2-4 times of human genes.

Q7. We only count it once because they are from the same DNA fragment - ie. it was only present once.

Q8. If each pair map to a different gene then we count it as one hit to each gene, because we have seen them both once (our DNA fragment just happened to be spanning both).

Q9. There are both genes in common and genes specific for each sample. Most of the genes have very low abundance (the blue field near 0) - this is also what we expected from the k-mer distributions (Q1).

Q10. Many of the species have very few genes, so we could probably not really trust all of them. We need to have a better reference genome set that covers more of the genomes in our samples (human gut). We could blast vs. human gut species instead.

Q11.Yes there are!

Q12. We can see that several Prevotella species are very abundant in the MH0032 individual compared to MH0047. Probably the MH0032 individual has the Prevotella enterotype.

Q13. 36 bins were identified. You can think of the bins as a clustering of the contig into what we believe are genomes. Importantly there will be errors and list will not be complete.

Q14. The length of the bins vary from 200k to 3.5Mb, why do you think this is?

Q15. We have both very nice bins (in the top) with high completeness and low contamination and bins that are less complete with higher contamination. The bins without marker genes could be incomplete bins or perhaps something that is not well known?

Metagenomic assembly exercise

2024-03-19T16:09:05Z

WikiSysop: Created page with "<H2>Overview</H2> <p>In this exercise we will try <i>de novo</i> assemble a metagenomic dataset of Illumina paired end reads from a stool sample. The data is part of the MetaHit project and was published [http://www.nature.com/nature/journal/v464/n7285/full/nature08821.html here]. These are just two samples of 396, in the next exercise (tomorrow) you will analyze data from 124 samples. Today, you will try to: <OL> <LI>Analyze k-mer distributions <LI>Assemble using SOA..."

<H2>Overview</H2>

<p>In this exercise we will try <i>de novo</i> assemble a metagenomic dataset of Illumina paired end reads from a stool sample. The data is part of the MetaHit project and was published [http://www.nature.com/nature/journal/v464/n7285/full/nature08821.html here]. These are just two samples of 396, in the next exercise (tomorrow) you will analyze data from 124 samples. Today, you will try to:
<OL>
<LI>Analyze k-mer distributions
<LI>Assemble using SOAPdenovo and MetaVelvet (well, it is already done)
<LI>Analyze output from metagenomic assemblies
<LI>Gene prediction and clustering on metagenomic data
<LI>Create a gene abundance matrix

</OL>

<HR>

<H2><i>de novo</i> assembly of metagenomic data</H2>

<HR>

<H3>Analyze k-mer distributions</H3>

Assembly of metagenomic data is much harder than assembly of single organism datasets. Here we will use SOAPdenovo as we did in the de novo assembly exercise and MetaVelvet. Let us first look at the k-mer coverage of our data, we already counted the kmers for you (else you could use jellyfish as we did for the V. cholerae exercise) and lets plot it:</p>

<pre>
mkdir meta
cd meta
</pre>

<pre>
cp /home/projects/22126_NGS/exercises/metagenomics/kmer_counts/MH0047.histo .
cp /home/projects/22126_NGS/exercises/metagenomics/kmer_counts/MH0032.histo .

R
pdf(file="kmerFreq.pdf", width=12, height=12)
par(mfrow=c(2,1))
d = read.table("MH0047.histo")
barplot(d[,2], main="MH0047 - Kmer Frequencies", xlab="K-mer coverage", xlim=c(1,400), ylim=c(0,1e6))

d = read.table("MH0032.histo")
barplot(d[,2], main="MH0032 - Kmer Frequencies", xlab="K-mer coverage", xlim=c(1,400), ylim=c(0,1e6))
dev.off()
q()
</pre>

Open the plot by downloading and viewing or using acroread.

<pre>
evince kmerFreq.pdf &
</pre>

<b>Q1. Can you identify the peak(s) in the Gaussian (bell-shaped) distributions in the two plots?</b>

<p>You should see that compared to the <i>V. cholera</i> sample that we used in the <i>de novo</i> exercise we havent got the same nice peak. Actually the x-axis in the plots goes from 1-400 so there seem to be something with quite high coverage in the MH0032 sample, however by far most of the sequences can not be separated from each other in terms of different abundance.</p>

<b>Q2. Can you think of why the coverage distributions looks like this with many sequences with "lower" coverage?</b>
<br>
<b>Q3. Given that MetaVelvet works by dividing the de bruijn graph using coverage peaks do you think it will perform better or worse compared to SOAPdenovo?</b>

<HR>

<H3><i>de novo</i> assembly</H3>

<p>Unfortunately because the assemblies takes 30-45 mins and uses 10-25Gb of RAM each you will not run the assemblies in the exercise, but you can see the code that I used to run it here [http://teaching.healthtech.dtu.dk/22126/index.php/Denovo_code code] if you need to assemble some genomes for the projects.</p>

<p>Instead copy the contigs and scaffolds to your folder and filter for minimum 100 bp:</p>

<pre>
cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0047.soap.scafSeq MH0047.soap.fa
cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0047.metavelvet.contigs.fa MH0047.velvet.fa
cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0047.megahit.contigs.fa MH0047.megahit.fa
cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0047.spades.fa MH0047.spades.fa

fastx_filterfasta.py --i MH0047.soap.fa --min 100
fastx_filterfasta.py --i MH0047.velvet.fa --min 100
fastx_filterfasta.py --i MH0047.megahit.fa --min 100
fastx_filterfasta.py --i MH0047.spades.fa --min 100
</pre>

<p>Ok now lets calculate coverage of the MetaVelvet and SOAPdenovo assemblies and plot them (the Megahit assembly does not have that information ready available so we will skip it).</p>

<pre>
fastx_soapcov.py --i MH0047.soap.fa.filtered_100.fa > MH0047.soap.cov
fastx_velvetcov.py --i MH0047.velvet.fa.filtered_100.fa > MH0047.velvet.cov

R
library(plotrix)
par(mfrow=c(1,2))
dat=read.table("MH0047.soap.cov", sep="\t")
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="SOAPdenovo - Weighted coverage", xlab="Contig coverage")
dat=read.table("MH0047.velvet.cov", sep="\t")
weighted.hist(w=dat[,2], x=dat[,1], breaks=seq(0,100, 1), main="MetaVelvet - Weighted coverage", xlab="Contig coverage")
dev.print("MH0047.coverage.pdf", device=pdf)
q()
</pre>

Open the plot by downloading and viewing or using acroread.

<pre>
evince MH0047.coverage.pdf &
</pre>

<b>Q4. Are there any differences between the two assemblies and what can you tell about the assembly from the coverage distributions?</b>

<p>Finally lets look at the stats for the three assemblies. We can either use paste to compare the files (paste file1 file2 file3 file4) but it becomes cumbersome when there are more than two files. Instead we will use R to create a table (assembly.stats.tab) and plot some of the key stats (assembly.stats.pdf).</p>

<pre>
assemblathon_stats.pl -csv MH0047.soap.fa.filtered_100.fa > MH0047.soap.fa.filtered_100.asm
assemblathon_stats.pl -csv MH0047.velvet.fa.filtered_100.fa > MH0047.velvet.fa.filtered_100.asm
assemblathon_stats.pl -csv MH0047.megahit.fa.filtered_100.fa > MH0047.megahit.fa.filtered_100.asm
assemblathon_stats.pl -csv MH0047.spades.fa.filtered_100.fa > MH0047.spades.fa.filtered_100.asm

R
soap=read.csv("MH0047.soap.fa.filtered_100.csv")
velvet=read.csv("MH0047.velvet.fa.filtered_100.csv")
megahit=read.csv("MH0047.megahit.fa.filtered_100.csv")
spades=read.csv("MH0047.spades.fa.filtered_100.csv")
df = t(rbind(soap, velvet, megahit, spades)[,-1])
options(scipen=999)
colnames(df) = c("Soap", "Velvet", "Megahit", "Spades")
write.table(df, "assembly.stats.tab", quote=FALSE, sep="\t")
par(mfrow=c(2,2))
barplot(df[1,], main=rownames(df)[1], ylab="Count")
barplot(df[2,], main=rownames(df)[2], ylab="Base pairs")
barplot(df[16,], main=rownames(df)[16], ylab="Base pairs")
barplot(df[18,], main=rownames(df)[18], ylab="Base pairs")
dev.print(file="assembly.stats.pdf", device=pdf)
</pre>

<b>Q5. Are there any large differences in the stats of the four assemblies?</b>

<HR>

<H2>Gene predictions and clustering</H2>

<p>Lets try to predict genes, we do this using [https://github.com/hyattpd/Prodigal Prodigal]. It is together with [http://exon.gatech.edu/meta_gmhmmp.cgi MetaGeneMark] one of the best and fastest prokaryotic gene finders available. We are using it in the metagenomic setting by setting "-p meta" and outputting the predictions as dna and as gff-format. Lets use the Spades assembly and for the sake of time lets also only use the contigs with a size >500bp (normally you could use down to 100bp). NB: The gene prediction takes 3-4 minutes each and we are running each in the background (the "&") - wait for the commands to finish. </p>

<pre>
cp /home/projects/22126_NGS/exercises/metagenomics/assemblies/MH0032.spades.fa MH0032.spades.fa
fastx_filterfasta.py --i MH0032.spades.fa --min 500
fastx_filterfasta.py --i MH0047.spades.fa --min 500

prodigal -p meta -f gff -d MH0047.prodigal.fna -i MH0047.spades.fa.filtered_500.fa -o MH0047.prodigal.gff &
prodigal -p meta -f gff -d MH0032.prodigal.fna -i MH0032.spades.fa.filtered_500.fa -o MH0032.prodigal.gff &
</pre>

<p>Lets look at how many genes were predicted?</p>

<pre>
grep ">" -c MH00*.prodigal.fna
</pre>

<b>Q6. How many genes are there in the samples - consider how many genes humans have (22k)?</b>

<p>To be able to create a count matrix of the gene abundances we must create a common set of genes of the all the samples we want to compare (in our case only two). To do this we can use [http://weizhong-lab.ucsd.edu/cd-hit/ cd-hit] which will cluster the genes based on sequence similarity and select a representative gene from each cluster that we will then use. First we combine the two gene sets, rename the genes so that names are unique and then cluster. We use a sequence identity threshold of 95% and say that the alignment must cover 90% of the sequence as shown below. However the clustering takes 15 min so you can copy the file I made to here:</p>

<pre>
# code for making it yourself
cat MH0047.prodigal.fna MH0032.prodigal.fna | rename_genes.sh humangut > combined.prodigal.fna
cd-hit-est -i combined.prodigal.fna -o combined.prodigal.cdhit.fna -c 0.95 -n 8 -l 100 -aS 0.9 -d 0 -B 0 -T 2 -M 10000

# code to copy
cp /home/27626/exercises/metagenomics/combined.prodigal.cdhit.fna .
</pre>

<HR>

<H2>Create gene abundance matrix</H2>

<p>Ok now that we have our common set of genes we can determine the abundance of each gene in our sample - we do that by mapping the reads from our samples to the common gene set. This we will do using bwa, but for this exercise, we will only use a subset of 0.5 mill reads from each of the libraries (<b>NB: this is only 1% of the total amount of reads</b>):</p>

<pre>
bwa index combined.prodigal.cdhit.fna

ln -s /home/27626/exercises/metagenomics/sub/* .

bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0032_081224.1.fq.gz MH0032_081224.2.fq.gz | samtools view -Sb - > MH0032_081224.bam
bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0032_091021.1.fq.gz MH0032_091021.2.fq.gz | samtools view -Sb - > MH0032_091021.bam

bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0047_081223.1.fq.gz MH0047_081223.2.fq.gz | samtools view -Sb - > MH0047_081223.bam
bwa mem -t 2 -M combined.prodigal.cdhit.fna MH0047_090201.1.fq.gz MH0047_090201.2.fq.gz | samtools view -Sb - > MH0047_090201.bam
</pre>

<p>Now that we have mapped the reads back to the gene set we can start counting. However before we start we need to consider what to do with paired end reads where both pairs map to the same gene.</p>

<b>Q7. If both pairs map to the same gene we only count it as one hit - can you think of why?</b><br>
<b>Q8. What should we do if each pair map to different genes?</b>

<p>We filter pairs mapping to the same gene using the read_count_bam.pl script below and then only takes reads that are mapped with a mapping quality better than 30 (-q30 below). After that we sort the bam-files:</p>

<pre>
samtools view -h MH0032_081224.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort -O BAM -o MH0032_081224.sort.bam -
samtools view -h MH0032_091021.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort -O BAM -o MH0032_091021.sort.bam -
samtools view -h MH0047_081223.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort -O BAM -o MH0047_081223.sort.bam -
samtools view -h MH0047_090201.bam | read_count_bam.pl | samtools view -Su -q30 - | samtools sort -O BAM -o MH0047_090201.sort.bam -
</pre>

<p>Now lets merge them to sample-bams and index them:</p>

<pre>
samtools merge MH0032.sort.bam MH0032_081224.sort.bam MH0032_091021.sort.bam
samtools merge MH0047.sort.bam MH0047_081223.sort.bam MH0047_090201.sort.bam
samtools index MH0032.sort.bam
samtools index MH0047.sort.bam
</pre>

<p>Now it is very easy to do the counting using samtools idxstats. It will output three columns: gene-name, gene-length, no. mapped_reads, no. unmapped_reads. Try it out:</p>

<pre>
samtools idxstats MH0032.sort.bam | less
</pre>

<p>Lets create a count matrix then. First we create a header line using "echo", then we get all of the gene-names, then the counts and finally we combined it one file:</p>

<pre>
echo -e "Gene\tMH0032\tMH0047" > header
samtools idxstats MH0032.sort.bam | grep -v "*" | cut -f1 > gene_names
samtools idxstats MH0032.sort.bam | grep -v "*" | cut -f3 > counts1
samtools idxstats MH0047.sort.bam | grep -v "*" | cut -f3 > counts2
paste gene_names counts1 counts2 | cat header - > count_matrix.sub.tab
</pre>

<p>Take a look at the count matrix using less. Also lets try to load it into R and look at the count distribution between the samples:</p>

<pre>
less count_matrix.sub.tab

R
d = read.table("count_matrix.sub.tab", sep="\t", header=TRUE, as.is=TRUE)
library(ggplot2)
p = ggplot(d, aes(x=MH0032, y=MH0047)) + stat_binhex(bins=50) + scale_fill_continuous(low="grey50", high="blue")
p + geom_abline(slope=1, col="white") + labs(title="Sub count matrix")
ggsave("count.sub.hex.pdf")
quit("no")
</pre>

</pre>

<p>You see that the counts per gene are quite low, but we actually previous course years used this data. But lets instead try to use a version of the count-matrix that I created using the full data. By now you have probably realised that metagenomics is <b>lots of data</b> and we are only using two samples with not that many reads - our [http://www.nature.com/nature/journal/v464/n7285/full/nature08821.html paper] has 396 samples in total! We will copy it to here and plot the same plot of counts as before. Last we open both in evince.</p>

<pre>
cp /home/27626/exercises/metagenomics/full_countmatrix/count_matrix.tab .

R
d = read.table("count_matrix.tab", sep="\t", header=TRUE, as.is=TRUE)
library(ggplot2)
p = ggplot(d, aes(x=MH0032, y=MH0047)) + stat_binhex(bins=50) + scale_fill_continuous(low="grey50", high="blue")
p + geom_abline(slope=1, col="white") + labs(title="Full count matrix")
ggsave("count.hex.pdf")
quit("no")

evince count.*hex.pdf &
</pre>

<b>Q9. How does the overlap between the samples look like, are there genes in common and/or genes specific for each samples?</b>

<HR>

<H2>Annotation of the count matrix</H2>

<p>Now that we have our count matrix we can annotate the genes according to species or function. Lets try to annotate them according to species.</p>

<p>To do that we need to blast all of the genes towards a database of organisms that we expect to be present - in our case lets blast vs. all fully sequenced bacteria at NCBI (>11000 bacterial chromosomes and plasmids) and 373 bacterial genomes from the human gut that we have published. The blast of our ~140k genes against this database takes 10 mins using 5 cores so you will not run this (if you really want to you can) - instead you can use the file I made. If you were to run the blast yourself the command is this:</p>

<pre>
# copy this file to your folder, it is the blast output #
cp /home/27626/exercises/metagenomics/combined.prodigal.cdhit.m8 .

# command to blast genes vs. bacteria - do not run it in the exercise (unless you really want to) #
# the -num_threads tells blast how many cores to use, eg. here we use 5 of the 28 cores on the machine #
# -max_target_seqs tells blast to only output the five best hits instead of the best 500 hits! #
blastn -query combined.prodigal.cdhit.fna -db /home/27626/exercises/metagenomics/blastdb/Bacteria_MGS.20160531 -evalue 1e-2 -outfmt 6 \
-max_target_seqs 5 -num_threads 5 -out combined.prodigal.cdhit.m8
</pre>

<p>Take a look at the blast-report (<b>combined.prodigal.cdhit.m8</b>), the output is in blast-tabular format (also known as m8) - the fields are: query name, subject name, percent identity, aligned length, no. mismatches, no. gaps, query start, query end, subject start, subject end, e-value, bitscore. You should see that there often are several hits pr. gene (query). We will select only the best hit and require that it has >80% identity over 100bp before we will annotate the gene to a species. This is achieved using this perl-oneliner:</p>

<pre>
perl -ane 'BEGIN{$prev=""}; if ($F[0] eq $prev) { next } else { if ($F[2] > 80 & $F[3] > 100) { print $_}; $prev = $F[0];}' \
combined.prodigal.cdhit.m8 > combined.prodigal.cdhit.best.accepted.m8
</pre>

<p>Now lets just take the columns 1 and 2 because these has the information of which gene is annotated to which organism and copy an annotation file to here. Take a look at the <b>Bacteria_MGS.20160531.tab</b> file (after you copied it to here - see below), you can see that it contains information on each sequence in the blast database and the phylogenetic information of that sequence (eg. phylum, genus, species etc). This is what we will use to figure out which gene comes from which species.</p>

<pre>
cut -f1,2 combined.prodigal.cdhit.best.accepted.m8 > combined.prodigal.ann
cp /home/27626/exercises/metagenomics/blastdb/Bacteria_MGS.20160531.tab .
</pre>

<p>Now we have:
<OL>
<LI>Count matrix with genes (count.matrix.tab)
<LI>Annotation of genes to species-ids (combined.prodigal.ann)
<LI>Annotation of species-ids to species names (Bacteria_MGS.20160531.tab)
</OL>
This we will use to create an abundance measure of the species in our sample using an R-script. For each species we determine the abundance of it as the sum of all genes annotated to that particular species:</p>

<pre>
R --vanilla count_matrix.tab combined.prodigal.ann Bacteria_MGS.20160531.tab species_matrix.tab < /home/27626/bin/create_species_matrix_sum.R
</pre>

<p>The script outputs the species matrix and a plot of how many genes that was annotated to each species - try to take a look at the species matrix and at the plot</p>

<pre>
less species_matrix.tab

evince genes_pr_species.pdf &
</pre>

<b>Q10. When looking at the plot, were there many species for which we could identify most genes? Can you think of how to improve this?</b><br>

<p>Lets plot the species-abundance versus each other in our two samples.</p>

<pre>
R
library(ggplot2)
d = read.table("species_matrix.tab", header=TRUE, sep="\t")
p = ggplot(d, aes(x=MH0032, y=MH0047)) + stat_binhex(bins=50) + scale_fill_continuous(name="No. species", low="grey50", high="blue")
p + geom_abline(slope=1, col="white") + labs(title="Species counts", x="Abundance in MH0032", y="Abundance in MH0047")
ggsave("species.count.hex.pdf")
</pre>

<b>Q11. Are there some species with differential abundance between the two samples?</b>

<p>Let us try to plot the species with the most differences in the abundance:</p>

<pre>
R
sm=read.table("species_matrix.tab", sep="\t", header=TRUE, as.is=TRUE)
sm$diff = sm[,1]-sm[,2]
min_diff = sm[order(abs(sm$diff)),][1:20,]
max_diff = sm[rev(order(abs(sm$diff))),][1:20,]
# plot differences
par(mar=c(9.5,4.1,4.1,2.1)) # change bottom margin to 9 cm
barplot(as.matrix(t(max_diff[,1:2])), beside=TRUE, las=2, cex.names=0.75, col=c("grey20", "grey80"), ylab="Species abundance", main="Species with most difference")
legend("topright", legend=c("MH0032", "MH0047"), bty="n", fill=c("grey20", "grey80"))
dev.print(file="species.differences.pdf", device=pdf)
</pre>

<b>Q12. Which genus seems to have the highest number of different species with most difference in abundance?</b>

<p>This is actually a very important genus in the human gut, one of the key drivers in determining which enterotype one is associated with - you can read more [http://www.nature.com/nature/journal/v473/n7346/full/nature09944.html here]. In the exercise on Quantitative metagenomics will try to analyze a species matrix of data from 120 individuals.</p>

<HR>

<H2>Binning using metabat (optional)</H2>

<p>Lets try to do unsupervised binning of the contigs using [https://bitbucket.org/berkeleylab/metabat Metabat]. To do that we will combine our contigs and cluster them using metabat so that if we have the same contig assembled twice we only write it once (this is quite tricky actually and if you have many samples you might want to do a co-assembly). First lets cat them together and do the run the clustering, then map the reads to the assembly (as above). <b>NB: I commented out ("#") all commands as you should not run them now - it takes a long time to run.</b></p>

<pre>
# ln -s /home/27626/exercises/metagenomics/MH0032/concat/MH0032.1.fq.gz .
# ln -s /home/27626/exercises/metagenomics/MH0032/concat/MH0032.2.fq.gz .
# ln -s /home/27626/exercises/metagenomics/MH0047/concat/MH0047.1.fq.gz .
# ln -s /home/27626/exercises/metagenomics/MH0047/concat/MH0047.2.fq.gz .
# cat MH0032.spades.fa MH0047.spades.fa | rename_contigs.sh humangut > both.spades.fa
# cd-hit-est -i both.spades.fa -o both.spades.cdhit.fna -c 0.95 -n 8 -l 100 -aS 0.9 -d 0 -B 0 -T 8 -M 30000
# bwa index both.spades.cdhit.fna
# bwa mem -t 10 both.spades.cdhit.fna MH0032.1.fq.gz MH0032.2.fq.gz | samtools view -Sb - > MH0032.bam
# bwa mem -t 10 both.spades.cdhit.fna MH0047.1.fq.gz MH0047.2.fq.gz | samtools view -Sb - > MH0047.bam
# samtools sort -O BAM -o MH0032.metabat.sort.bam MH0032.bam
# samtools sort -O BAM -o MH0047.metabat.sort.bam MH0047.bam
# samtools index MH0032.metabat.sort.bam
# samtools index MH0047.metabat.sort.bam
</pre>

<p>Ok so instead of running all that yourself you can link the files from me and start directly with the metabat command.</p>

<pre>
ln -s /home/27626/exercises/metagenomics/metabat_on_spades/MH*.metabat.sort.bam .
ln -s /home/27626/exercises/metagenomics/metabat_on_spades/MH*.metabat.sort.bam.bai .
ln -s /home/27626/exercises/metagenomics/metabat_on_spades/both.spades.cdhit.fna .
runMetaBat.sh both.spades.cdhit.fna MH0032.metabat.sort.bam MH0047.metabat.sort.bam
</pre>

<p> Take a look a the output-folder, here you should see the bins it identified. These are the "genomes" that it has learnt from the data using tetra-nucleotide-frequency content - if there are more than 10 samples it is also using co-abundance.<br>
<b>Q13. How many bins was identified? </b></p>

<p>Try to look at the length of the bins:</p>

<pre>
fastx_sumofsequence.py -f fasta both.spades.cdhit.fna.metabat-bins/*.fa | sort -k2 -n
</pre>

<p><b>Q14. How many bases are there in the bins? Do you think they are all complete?</b></p>

<p>Lets try to investigate the bins - this can be done by blasting the contigs in each bin and analyzing the output or alternative using [https://github.com/Ecogenomics/CheckM CheckM]. We will try to use checkM which is an automated pipeline that uses single marker genes to assess how complete each bin is and if they are potentially contaminated.</p>

<pre>
checkm lineage_wf both.spades.cdhit.fna.metabat-bins checkm -x fa -t 2 -f checkm.output
# cp /home/27626/exercises/metagenomics/checkm.output .
</pre>

<p>If checkm is taking a long time to run you can copy my file (the last line above) by removing the hash-sign. Now lets take a look at the "checkm.output" file.<br>
<b>Q15. Does it look like our bins are good or bad (or both?). You may notice some bins without any marker genes, does this mean they are not real?</b><br>
<b>Q16. Which next steps could one do?</b></p>
<HR>

<H2>Additional analyses that can be done</H2>

<p>We can investigate the abundance of different functional characteristics. This can be achieved by eg. blasting all of our genes to different databases such as ncbi-nt, kegg, eggnog etc.</p>

<HR>

<p>Congratulations you finished the exercise!</p>

<HR>
<H2> EXTRA COMMANDS FOR THE PROJECT </H2>
<HR>

<H3>Using linclust instead of cd-hit</H3>
<p>cd-hit for clustering of genes can be a bit slow. Instead you can run linclust which is <i>much</i> faster. The output is the combined.prodigal.linclust.fna file.</p>

<pre>
module load mmseqs2/release_6-f5a1c
mmseqs createdb combined.prodigal.cdhit.fna combined
mkdir tmp
mmseqs linclust combined combined_cluster tmp -c 0.9 --min-seq-id 0.95 --threads 8
mmseqs result2repseq combined combined_cluster combined_clu_rep
mmseqs result2flat combined combined combined_clu_rep combined.prodigal.linclust.fna --use-fasta-header
</pre>

<HR>

Program 2020

2024-03-19T16:07:17Z

WikiSysop: Created page with " '''REMEMBER TO BRING A LAPTOP COMPUTER FOR EXERCISES''' Lectures and exercises will take place at the Section for Bioinformatics at the Technical University of Denmark, '''Building 210, Room H162''' in Lyngby ([https://goo.gl/maps/hyifvFrZ5cPLroCv8 map]). The courses has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on Friday 24th of January. This term we will be using Piazza for class discussion. The..."

'''REMEMBER TO BRING A LAPTOP COMPUTER FOR EXERCISES'''

Lectures and exercises will take place at the Section for Bioinformatics at the Technical University of Denmark, '''Building 210, Room H162''' in Lyngby ([https://goo.gl/maps/hyifvFrZ5cPLroCv8 map]).

The courses has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on Friday 24th of January.

This term we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza [https://piazza.com/danish_technical_university/summer2019/1 here]

=== Course Program - January 2020 ===

<HR>
'''Monday, January 6'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am
<DD>Introduction to course & Pre-test
([http://teaching.healthtech.dtu.dk/material/22126/2020/1_Introduction_to_course_GR.pdf Lecture slides])
([http://teaching.healthtech.dtu.dk/material/22126/2020/1_pre-test_2020.pdf Pre-test])</dd>
<DD>Gabriel Renaud

<DT>9:30am-10:00am
<dd>Introduction to NGS
([http://teaching.healthtech.dtu.dk/material/22126/2020/1_Introduction_to_NGS_GR.pdf Lecture slides]) </dd>
<DD>Gabriel Renaud

<DT>10:00am-10:45am
<dd>2nd and 3rd generation NGS Technologies
([http://teaching.healthtech.dtu.dk/material/22126/2020/1_Introduction_to_NGS_technology_GR_2.pdf Lecture slides])
<DD>Gabriel Renaud

<DT>10:45am-11:00am
<DD>''Break''

<DT>11:00am-12:00pm
<DD>Tech talk group formation and group work
([http://teaching.healthtech.dtu.dk/material/22126/2020/1_Tech_Talks_GR.pdf Lecture slides])
([https://docs.google.com/spreadsheets/d/1awQ_-KiYC7r3IZXevJSCJ-gDMD8BNye4VOq98zy3z3U/edit?usp=sharing Student Groups]) </dd>
<DD>Gabriel Renaud

<DT>12:00pm-1:00pm
<DD>''Lunch Break''

<DT>1:00pm-1:30pm
<DD>Exercise: Logging on to our Computerome cloud
([[Logging on to Cloud system]])</dd>
<DD>Peter Wad Sackett, Bernadette Kofoed Christiansen, Nanna Møller Barnkob

<DT>1:30pm-2:15pm
<DD>Introduction to UNIX
([http://teaching.healthtech.dtu.dk/material/36610/UnixIntroduction.ppt Lecture slides])
([http://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercise])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])
<DD>Peter Wad Sackett, Bernadette Kofoed Christiansen, Nanna Møller Barnkob

<DT>2:15pm-2:30pm
<DD>''Break''

<DT>2:30pm-3:30pm
<DD>Introduction to UNIX
([http://teaching.healthtech.dtu.dk/material/36610/UnixIntroduction.ppt Lecture slides])
([http://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercise])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])
<DD>Peter Wad Sackett, Bernadette Kofoed Christiansen, Nanna Møller Barnkob

<DT>3:30pm-4:00pm
<DD>First look at data
([[First look exercise]])
<DD>Peter Wad Sackett, Bernadette Kofoed Christiansen, Nanna Møller Barnkob

</DD>
<BR>
</DL>

<HR>
'''Tueday, January 7'''
<HR>
''Data pre-processing & Genomic Epidemiology''
<DL>
<DT>9:00am-10:45am
<DD>Data basics
([http://teaching.healthtech.dtu.dk/material/22126/Data_Basics.pdf Lecture slides])
([[Data basics exercise]])
([[Data basics solution]])
<DD>Shyam Gopalakrishnan, Bernadette Kofoed Christiansen, Freja Dahl Hede

<DT>10:45am-11:00am
<DD>''Break''

<DT>11:00am-12:00pm
<DD>Data pre-processing
([http://teaching.healthtech.dtu.dk/material/22126/Data_preprocessing.pdf Lecture slides])
([[Data Preprocess exercise]])
([[Data Preprocess solution]])
<DD>Shyam Gopalakrishnan, Bernadette Kofoed Christiansen, Freja Dahl Hede

<DT>12:00pm-1:00pm
<DD>''Lunch Break''

<DT>1:00pm-2:15pm
<DD>Alignment
([http://teaching.healthtech.dtu.dk/material/22126/2020/2_Alignment_GR.pdf Lecture slides])
<DD>Gabriel Renaud

<DT>2:15pm-2:30pm
<DD>''Break''

<DT>2:30pm-4:00pm
<DD>Exercise: Alignment
([[Alignment exercise]])
([[Alignment solution]])
<DD>Gabriel Renaud, Bernadette Kofoed Christiansen, Freja Dahl Hede
</DD>
<BR>
</DL>

<HR>
'''Wednesday, January 8'''
<HR>

''Alignment & Genotyping''
<DL>
<DT>9:00am-9:30am
<DD>Functional Human Variation
<DD>Adrian Otamendi Laspiur, ([http://teaching.healthtech.dtu.dk/material/22126/2020/Intro_to_variation.pdf Lecture slides])

<DT>9:30am-10:00am
<DD>Alignment postprocessing & variant calling
([http://teaching.healthtech.dtu.dk/material/22126/2020/3_post_alignment_variantcalling_GR.pdf Lecture slides])

<DD>Gabriel Renaud

<DT>10:00am-10:15am
<DD>''Break''

<DT>10:15am-12:00pm
<DD>Exercise: Postprocessing & variant calling
([[Postprocess exercise]])
([[Postprocess solution]])
([[SNP calling exercise]])
([[SNP calling solution]]) </dd>
<DD>Gabriel Renaud, TBA

<DT>12:00pm-1:00pm
<DD>''Lunch Break''

<DT>1:00pm-2:15pm
<DD>Ancient DNA
([http://teaching.healthtech.dtu.dk/material/22126/2020/3_ancientDNA2020.pdf Lecture slides])
<DD>Gabriel Renaud

<DT>2:15pm-2:30pm
<DD>''Break''

<DT>2:30pm-4:00pm
<DD>Exercise: Ancient DNA
([[Ancient DNA exercise]])

<DD>Gabriel Renaud, TBA
</DD>
<BR>
</DL>

<HR>
'''Thursday, January 9'''
<HR>
''de novo assembly & Metagenomics''
<DL>

<DT>9:00am-9:45am
<DD>Lecture: de novo assembly
([http://teaching.healthtech.dtu.dk/material/22126/de_novo_assembly_course.pdf Lecture slides])
([http://teaching.healthtech.dtu.dk/material/22126/debruijn_handout.pdf Handout]) </dd>
<DD>Shyam Gopalakrishnan

<DT>9:45am-10:00am
<DD>''Break''

<DT>10:00am-12:00pm
<DD>Exercise: de novo assembly
([[denovo exercise]])
([[denovo solution]]) </dd>
<DD>Shyam Gopalakrishnan, Nanna Møller Barnkob

<DT>12:00pm-1:00pm
<DD>''Lunch Break''

<DT>1:00pm-1:45pm
<DD>Metagenomics & Binning
([http://teaching.healthtech.dtu.dk/material/22126/Metagenomics_assembly.pdf Lecture slides])
</dd>
<DD>Gisle Vestergaard

<DT>1:45pm-2:00pm
<DD>''Break''

<DT>2:00pm-4:00pm
<DD>Exercise: Metagenomic de novo assembly
([[Metagenomic assembly exercise]])
([[Metagenomic assembly solution]]) </dd>
<DD>Gisle Vestergaard

</DD>
<BR>
</DL>

<HR>
'''Friday, January 10'''
<HR>
''Quantitative Metagenomics and Test''
<DL>

<DT>9:00am-9:45am
<DD>Lecture: Quantitative Metagenomics
([http://teaching.healthtech.dtu.dk/material/22126/quantitative_metagenomics.pdf Lecture slides])</dd>
<DD>Gisle Vestergaard

<DT>9:45am-10:00am
<DD>''Break''

<DT>10:00am-12:00pm
<DD>Exercise: Quantitative Metagenomics
([http://teaching.healthtech.dtu.dk/22126/index.php/QuantitativeMetagenomics Exercise])
([http://teaching.healthtech.dtu.dk/22126/index.php/QuantitativeMetagenomicsSolution Solution]) </dd>
<DD>Gisle Vestergaard

<DT>12:00pm-1:00pm
<DD>''Lunch Break''

<DT>1:00pm-1:45pm
<DD>Recap Test
([http://teaching.healthtech.dtu.dk/material/22126/Recap_test_2019.pdf Test])
<DD>Gabriel Renaud, Line Egerod Lund

<DT>1:45pm-2:00pm
<DD>''Break''

<DT>2:00pm-4:00pm
<DD>Time to work on this week's exercises
<DD>Gabriel Renaud, Line Egerod Lund

</DD>
<BR>
</DL>

<HR>
'''Monday, January 13'''
<HR>
''RNA-seq and Cancer-seq''
<DT>9:00am-9:45am
<DD>Lecture: RNAseq
([http://teaching.healthtech.dtu.dk/material/22126/2020/RNA-seq%2013-01-2020.pdf Lecture slides])</dd>
<DD>Francesca Bertolini

<DT>9:45am-10:00am
<DD>''Break''

<DT>10:00am-12:00pm
<DD>Exercise: RNAseq
([http://teaching.healthtech.dtu.dk/material/22126/2020/RNAseq_excercise_2020.docx Exercise])
([http://teaching.healthtech.dtu.dk/material/22126/2020/RNAseq_answers_2020.txt Solution]) </dd>
<DD>Francesca Bertoloni, Nanna Møller Barnkob, Bernadette Kofoed Christiansen

<DT>12:00pm-1:00pm
<DD>''Lunch Break''

<DT>1:00pm-1:45pm
<DD>Lecture: Cancer-seq
([http://teaching.healthtech.dtu.dk/material/22126/2020/CancerGenomics_Izarzugaza.pdf Lecture slides])
<DD>Adrian Otamendi Laspiur, Nanna Møller Barnkob, Bernadette Kofoed Christiansen

<DT>2:00pm-4:00pm
<DD>Exercise: Cancer-seq
([https://github.com/nannabarnkob/dtu-cancer/blob/master/cancer_seq_exercise.md])
<DD>Adrian Otamendi Laspiur, Nanna Møller Barnkob, Bernadette Kofoed Christiansen

</DD>

<HR>
'''Tuesday, January 14'''
<HR>
<DT>9:00am-9:55am
<DD>Exercise: Genomic Epidemiology
([[Genomic epidemiology exercise]])
([[Genomic epidemiology solution]])
<DD>Shyam Gopalakrishnan
<DT>9:55am-10:10am
<DD>''Break''
<DT>10:10am-12:00pm
<DD>Case story: Genomic Epidemiology
([http://teaching.healthtech.dtu.dk/material/22126/2020/Genomic_epidemiology_NGScourse_Jan2020.pdf Lecture])
<DD>Pimlapas Leekitecharoenphon (Shinny)
</DD>
<DT>12:00pm-1:00pm
<DD>''Lunch Break''
<DT>1:00pm-4:00pm
<DD>Tech talk work & [http://teaching.healthtech.dtu.dk/material/22126/TechTalks.tar Presentations]
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Line Egerod Lund
</DD>

<HR>
'''Wednesday, January 15'''
<HR>
<P>''Project work''<BR>
<DL>

<DT>9:00am-9:45am
<DD>Projects & Group formation
([http://teaching.healthtech.dtu.dk/material/22126/2020/9_Projects_GR.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/posters.tar.gz Examples from previous courses])
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Line Egerod Lund

<DT>9:45am-10:45am
<DD>Projects & Group formation, please write group names in the [https://docs.google.com/document/d/1MUkqNE9GSzTupR3ro4EQEybF5UR1TKQ4Dw5usi22tz4/edit?usp=sharing document]
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Line Egerod Lund

<DT>10:45am-11:00am
<DD>''Break''

<DT>11:00am-12:00pm
<DD>Project work
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Line Egerod Lund

<DT>12:00pm-1:00pm
<DD>''Lunch Break''

<DT>1:00pm-4:00pm
<DD>Project work/Prepare presentations for tomorrow
How to get to Gabriel/Shyam/Gisle office: first go to 202, go to the 3rd floor and go to building 204 via the skyway. We are in 252/251
</DD>
<BR>
</DL>
<HR>
'''Thurday, January 16'''
<HR>
''Project work & Project Presentations''
<DL>

<DT>9:00am-12:00pm
<DD>Project work
<DD>Gabriel Renaud, Shyam Gopalakrishnan, TBA

<DT>12:00pm-1:00pm
<DD>''Lunch Break''

<DT>1:00pm-2:00pm
<DD>Project Presentations
<DD>Gabriel Renaud, Shyam Gopalakrishnan, TBA

</DD>
<BR>
</DL>
<HR>
'''Friday, January 17'''
<HR>
''Project work''
<DL>

<DT>9:00am-4:00pm
<DD>Project work
<DD>

<DT>1:00pm-3:00pm
<DD>Project work/Office hours
<DD>Gabriel Renaud, Shyam Gopalakrishnan, TBA

</DD>
<BR>
</DL>
<HR>
'''Monday, January 20'''
<HR>
''Project work''
<DL>

<DT>9:00am-4:00pm
<DD>Project work
<DD>

<DT>1:00pm-3:00pm
<DD>Project work/Office hours
<DD>Gabriel Renaud, Shyam Gopalakrishnan, TBA

</DD>
<BR>
</DL>
<HR>
'''Tueday, January 21'''
<HR>
''Project work''
<DL>

<DT>9:00am-4:00pm
<DD>Project work
<DD>

<DT>1:00pm-3:00pm
<DD>Project work/Office hours
<DD>Gabriel Renaud, Shyam Gopalakrishnan, TBA

</DD>
<BR>
</DL>
<HR>
'''Wednesday, January 22'''
<HR>
''Project work''
<DL>

<DT>9:00am-4:00pm
<DD>Project work
<DD>

<DT>1:00pm-3:00pm
<DD>Project work/Office hours
<DD>Gabriel Renaud, Shyam Gopalakrishnan, TBA

</DD>
<BR>
</DL>
<HR>
'''Thurday, January 23'''
<HR>
''Project Work & Poster Printing''
<DL>

<DD>Print your poster, it is possible to print in DTU library for 16-30kr.
<DD>[http://teaching.healthtech.dtu.dk/material/22126/Posters.pdf Poster guide & requirements]

<DT>10:00am-12:00pm
<DD>Q&A: Practical information about the [http://teaching.healthtech.dtu.dk/material/22126/exam.pdf Exam]
<DD>Project work/Office hours
<DD>Gabriel Renaud, Shyam Gopalakrishnan, TBA

</DL>
<br>
<HR>
<B>Friday, January 24</B> <BR>
<HR>
''Poster session - in the hall of TBA''
<DL>

<DT>10:00am-2:00pm
<DD>Poster session part ('''exam''')

</DD>
<BR>
</DL><HR>

Program 2021

2024-03-19T16:06:46Z

WikiSysop: Created page with "  Lectures and exercises will take place on Discord (https://discord.gg/JxV3pHyHgV). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on Friday 22th of January. This term, we will be using Piazza for class dis..."

Lectures and exercises will take place on Discord (https://discord.gg/JxV3pHyHgV). Please register with your full name. Will use Discord for online classes and collaboration with your project partners.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on Friday 22th of January.

This term, we will be using Piazza for class discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions on Piazza [https://piazza.com/class/kjernxm6wre7ck here]

=== Course Program - January 2021 ===

<HR>
'''Monday, January 4'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course & Pre-test
([https://teaching.healthtech.dtu.dk/material/22126/2021/11_Introduction_to_course_GR.pdf Lecture slides])
([https://teaching.healthtech.dtu.dk/material/22126/2021/1_pre-test_2021.pdf Pre-test])</DD>
<DD>Gabriel Renaud</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2021/12_Introduction_to_NGS_GR.pdf Lecture slides]) </DD>
<DD>Gabriel Renaud</DD>

<DT>10:00am-10:45am</DT>
<DD>2nd and 3rd generation NGS Technologies
([https://teaching.healthtech.dtu.dk/material/22126/2021/13_Introduction_to_NGS_technology_GR.pdf Lecture slides])</DD>
<DD>Gabriel Renaud</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Tech talk group formation and group work
([https://teaching.healthtech.dtu.dk/material/22126/2021/14_Tech_Talks_GR.pdf Lecture slides])
([https://docs.google.com/spreadsheets/d/1Ul_sb43hdxCkenJM9rEPzudyXVWtZVxwIZrxgan9Z_g/edit?usp=sharing Student Groups 2021]) </DD>
<DD>Gabriel Renaud</DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our Computerome cloud ([[Logging on to Cloud system]])</DD>
<DD>Peter Wad Sackett,Trine Zachariasen, Gabriel Renaud, </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Video lectures to watch from "Unix intro.." to "Touching upon..."])
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercises] possible answers [[Unix_answers|here]])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])</DD>
<DD>Peter Wad Sackett, Trine Zachariasen </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Video lectures to watch from "Unix intro.." to "Touching upon..."])
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercises] possible answers [[Unix_answers|here]])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])</DD>
<DD>Peter Wad Sackett, Trine Zachariasen </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Peter Wad Sackett, Trine Zachariasen</DD>
</DL>

<BR>

<HR>
'''Tueday, January 5'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-10:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2021/2_Data_Basics_SG.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD>Shyam Gopalakrishnan, Trine Zachariasen</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2021/2_Data_preprocessing_SG.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD>Shyam Gopalakrishnan </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-2:15pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2021/23_Alignment_GR.pdf Lecture slides]) </DD>
<DD>Gabriel Renaud </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break'' </DD>

<DT>2:30pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD>Gabriel Renaud, Trine Zachariasen</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 6'''
<HR>

''Alignment & Genotyping''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Human Variation</DD>
<DD>Shyam Gopalakrishnan, ([https://teaching.healthtech.dtu.dk/material/22126/2021/FunctionalHumanVariation_SG.pdf Lecture slides])</DD>

<DT>9:30am-10:00am</DT>
<DD>Alignment postprocessing & variant calling ([https://teaching.healthtech.dtu.dk/material/22126/2021/41_post_alignment_variantcalling_GR.pdf Lecture slides])</DD>

<DD>Gabriel Renaud</DD>

<DT>10:00am-10:15am</DT>
<DD>''Break''</DD>

<DT>10:15am-12:00pm</DT>
<DD>Exercise: Postprocessing & variant calling ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]]) ([[SNP calling exercise]]) ([[SNP_calling_exercise_answers]])</DD>
<DD>Gabriel Renaud, Trine Zachariasen </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: de novo assembly ([https://teaching.healthtech.dtu.dk/material/22126/2021/4_de_novo_assembly_course_SG.pdf Lecture slides])([http://teaching.healthtech.dtu.dk/material/22126/debruijn_handout.pdf Handout (TODO)]) </DD>
<DD>Shyam Gopalakrishnan</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD>Shyam Gopalakrishnan, Trine Zachariasen</DD>
</DL>

<BR>

<HR>
'''Thursday, January 7'''
<HR>
''Metagenomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Metagenomics & Binning ([https://teaching.healthtech.dtu.dk/material/22126/2021/Metagenomics_binning.pdf Lecture slides])</DD>
<DD>Gisle Vestergaard</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: Kaiju: Taxonomic classification ([[Kaiju exercise]]) ([[Kaiju solution]]) </DD>
<DD>Gisle Vestergaard</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: Quantitative Metagenomics ([https://teaching.healthtech.dtu.dk/material/22126/2021/Quantitative_metagenomics.pdf Lecture slides])</DD>
<DD>Gisle Vestergaard</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Quantitative Metagenomics ([http://teaching.healthtech.dtu.dk/22126/index.php/QuantitativeMetagenomics Exercise]) ([http://teaching.healthtech.dtu.dk/22126/index.php/QuantitativeMetagenomicsSolution Solution]) </DD>
<DD>Gisle Vestergaard</DD>
</DL>

<BR>

<HR>
'''Friday, January 8'''
<HR>
''Cell-free DNA and recap test''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Cell-free DNA ([https://teaching.healthtech.dtu.dk/material/22126/2021/cfDNA_lecture_2020_SB.pdf Lecture slides])</DD>
<DD>Søren Besenbacher</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00am</DT>
<DD>Exercise: Cell free DNA ([[cfDNA exercise]])([[CfDNA_exercise_answers]])</DD>
<DD>Søren Besenbacher, Trine Zachariasen</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2021/test_2021.pdf Test 2021])([https://teaching.healthtech.dtu.dk/material/22126/2021/test_2021_withA.pdf answers])</DD>
<DD>Gabriel Renaud, Trine Zachariasen </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Time to work on this week's exercises</DD>
<DD>Gabriel Renaud, Trine Zachariasen</DD>
</DL>

<BR>

<HR>
'''Monday, January 11'''
<HR>

''RNA-seq and Cancer-seq''
<DL>

<DT>9:00am-9:45am</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2021/RNA-seq%2011-01-2021.pdf Lecture slides])</DD>
<DD>Francesca Bertolini</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]]) ([[Rnaseq_exercise_answers]]) </DD>
<DD>Francesca Bertoloni, Trine Zachariasen </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: Cancer-seq ([https://teaching.healthtech.dtu.dk/material/22126/2021/CancerGenomics_Izarzugaza%20copy.pdf Lecture slides]) </DD>
<DD>Adrian Otamendi, Trine Zachariasen </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Cancer-seq ([[Cancerseq_exercise]]) ([[Cancerseq_exercise_answers]])</DD>
<DD>Adrian Otamendi, Trine Zachariasen </DD>
</DL>

<BR>

<HR>
'''Tuesday, January 12'''
<HR>

''Genomic Epidemiology and tech talk''

<DL>
<DT>9:00am-9:55am</DT>
<DD>Exercise: Genomic Epidemiology ([[Genomic epidemiology exercise]]) ([[Genomic epidemiology solution]])</DD>
<DD>Shyam Gopalakrishnan</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Case story: Genomic Epidemiology ([https://teaching.healthtech.dtu.dk/material/22126/2021/Genomic_epidemiology_NGScourse_Jan2021.pdf Lecture])</DD>
<DD>Pimlapas Leekitecharoenphon (Shinny)</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-2:00pm </DT>
<DD>Tech talk work</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen </DD>

<DT>2:00pm-4:00pm </DT>
<DD>TechTalks Presentations</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen </DD>

</DL>

<BR>

<HR>
'''Wednesday, January 13'''
<HR>
''Ancient DNA & Project work''

<DL>
<DT>9:00am-10:00am</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2021/dtu_adna_2021.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00am-10:15am</DT>
<DD>''Break''</DD>

<DT>10:15am-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Trine Zachariasen</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2021/82_Projects_GR.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/posters.tar.gz Examples from previous courses]) </DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>

<DT>1:45pm-4:00pm </DT>
<DD>Projects & Group formation, prepare your presentation for tomorrow. please write group names in the [https://docs.google.com/document/d/1xVCMb6wQdqsX4Q5_d-X5FlxeeWcKxP_ilxqd5RmSBSg/edit?usp=sharing document for 2021]</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>
</DL>

<BR>

<HR>
'''Thursday, January 14'''
<HR>
''Project presentation''
<DL>
<DT>9:00am-12:00pm</DT>
<DD>Project work/Prepare presentations for this afternoon</DD>
<DD>Please go to Discord for help, we will be available.</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-2:00pm <DT>
<DD>Project Presentations (what you will do) /Project work</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>

<DT>2:00pm-4:00pm </DT>
<DD>Project work</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>
</DL>

<BR>

'''Friday, January 15'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>
</DL>

<BR>

'''Monday, January 18'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>
</DL>

<BR>

'''Tuesday, January 19'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>
</DL>

<BR>

'''Wednesday, January 20'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>
</DL>

<BR>

'''Thursday, January 21'''
<HR>
''Project Work & Poster Printing''
<DL>

<DD>Produce a PDF of your poster, presentation will online this year.</DD>
<DD>[http://teaching.healthtech.dtu.dk/material/22126/Posters.pdf Poster guide & requirements]</DD>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the [http://teaching.healthtech.dtu.dk/material/22126/exam.pdf Exam]</DD>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Trine Zachariasen</DD>
</DL>

<BR>

'''Friday, January 22'''
<HR>
''Poster session - Online''
<DL>
<DT>10:00am-2:00pm</DT>
<DD>Poster session part ('''exam''')</DD>
</DL>

CfDNA exercise answers

2024-03-19T16:05:43Z

WikiSysop: Created page with "'''Q1''' Using: <pre> samtools stat /data/shared/exercises/cfdna/patient_1.bam > patient1.stat samtools stat /data/shared/exercises/cfdna/patient_2.bam > patient2.stat plot-bamstats -p patient1 patient1.stat plot-bamstats -p patient2 patient2.stat firefox patient1.stat firefox patient2.stat </pre> Both peak at 168bp however patient 1) clearly has the greatest variance and has an overabundance of short DNA fragments compared to patient 2. '''Q2''' We first run: <p..."

'''Q1'''

Using:

<pre>
samtools stat /data/shared/exercises/cfdna/patient_1.bam > patient1.stat
samtools stat /data/shared/exercises/cfdna/patient_2.bam > patient2.stat

plot-bamstats -p patient1 patient1.stat
plot-bamstats -p patient2 patient2.stat

firefox patient1.stat
firefox patient2.stat

</pre>

Both peak at 168bp however patient 1) clearly has the greatest variance and has an overabundance of short DNA fragments compared to patient 2.

'''Q2'''

We first run:
<pre>
readCounter --window 100000 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 /data/shared/exercises/cfdna/patient_1.bam > patient_1.wig
readCounter --window 100000 -c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 /data/shared/exercises/cfdna/patient_2.bam > patient_2.wig
</pre>

to get the fragment counts per 100kb window. Then:

<pre>
grep -v fixed patient_1.wig |wc -l
28823
</pre>

'''Q3'''

Running:
<pre>
/data/shared/exercises/cfdna/plotCNV.R patient_1.wig
/data/shared/exercises/cfdna/plotCNV.R patient_2.wig
evince patient_1_CNV.pdf
evince patient_2_CNV.pdf
</pre>

Clearly shows that patient number one has a lot of alterations in terms of copy number.

'''Q4'''

Based on the plots for the fragment size distribution and copy number variations, patient 1 had lung squamous cell carcinoma and patient 2 was a healthy control.

CfDNA exercise

2024-03-19T16:05:10Z

WikiSysop: Created page with "<H2>Overview</H2> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "cfdna" <LI>Navigate to the directory you just created. </OL> Blood was drawn from 2 patients. One of those is a healthy patient, the other one has lung squamous cell carcinoma. It was previously [https://www.cell.com/fulltext/S0092-8674(15)01569-X reported] that tumors can leave a signature in the blood plasma via cell-free DNA, namely: # The distribution of fragment l..."

<H2>Overview</H2>

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "cfdna"
<LI>Navigate to the directory you just created.
</OL>

Blood was drawn from 2 patients. One of those is a healthy patient, the other one has lung squamous cell carcinoma.

It was previously [https://www.cell.com/fulltext/S0092-8674(15)01569-X reported] that tumors can leave a signature in the blood plasma via cell-free DNA, namely:

# The distribution of fragment length will have higher variance especially with very short fragments.
# Tumor cfDNA (or ctDNA) tend to have higher than normal [https://en.wikipedia.org/wiki/Copy_number_variation copy number variations (CNVs)]. Please also note that in the context of cancer, when these CNVs are caused by the tumor, they are sometimes called copy number alterations (CNAs).

Your goal is to determine which patient is which.

<H2>Insert size distribution</H2>

To speed up things, the data has already been trimmed and aligned:

<pre>
/data/shared/exercises/cfdna/patient_1.bam
/data/shared/exercises/cfdna/patient_2.bam
</pre>

We now have to determine which one has the greatest variance in terms of DNA fragment length. Use the commands that you have learned in the alignment exercises and plot the insert size distribution.

'''Q1'''

Which would you say has the greatest variance in terms of DNA fragment length?

<H2> Copy number variations</H2>

We will use a program called [https://bioconductor.org/packages/release/bioc/html/HMMcopy.html HMMcopy] to infer CNVs. This method relies on the number of DNA fragments that align in a very specific window size (ex: how many fragments align every 100,000bp of the genome).

The following program:

<pre>
readCounter
</pre>

Is a very simple utility the compute the number of fragments aligning in a series of genomic windows. Type it without any arguments to see the options. Use the following option:

<pre>
-c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22
</pre>

This specifies the order of the chromosomes. For now, we will only care about chromosomes 1 to 22. Count the number of fragments that align in windows of 100000 basepairs for both patients. For each bam file, redirect the output to a file then you get to choose the name but use the file extension ".wig" please as the output will be in [https://genome.ucsc.edu/goldenPath/help/wiggle.html wiggle track format (WIG)].

'''Q2'''

Inspect the format and determine how many genomic windows do you have in total?

We will done use the following custom R script to call copy number variations:

<pre>
/data/shared/exercises/cfdna/plotCNV.R [INPUT WIG]
</pre>

The input is the wig file that you have previously generated. The script produce three files:

# [INPUT]_bias.pdf which shows you the fragment count as a function of [https://academic.oup.com/nar/article/40/10/e72/2411059 GC bias]
# [INPUT]_correction.pdf which shows you the fragment count once having corrected for GC bias
# [INPUT]_CNV.pdf shows you the annotated copy number variations.

For now, the script only produces a plot for chromosome 6.

'''Q3'''

Which sample do you believe has a lot of copy number variations?

'''Q4'''

Based on your answers to question 1 and 3, which patient has cancer?

Please find answers [[CfDNA_exercise_answers|here]]

'''Congratulations you finished the exercise!'''

Logging on to Cloud system

2024-03-19T16:04:25Z

WikiSysop: Created page with " <HR> <H2>Overview</H2> In this exercise, we will prepare our computers to log on to our very own cloud system on Computerome. This is done via a Virtual Private Network (VPN). <HR> <H3>Student accounts</H3> <p>To be able to perform the exercises and project you will need an account to log in to the cloud computers. Open the [https://docs.google.com/spreadsheets/d/1TAj59JQurp6ene6kOlrug9mRJSHpdsIS8U6KKvkcLN0/edit?usp=sharing google doc], find student id and determin..."

<HR>

<H2>Overview</H2>

In this exercise, we will prepare our computers to log on to our very own cloud system on Computerome. This is done via a Virtual Private Network (VPN).

<HR>

<H3>Student accounts</H3>

<p>To be able to perform the exercises and project you will need an account to log in to the cloud computers. Open the
[https://docs.google.com/spreadsheets/d/1TAj59JQurp6ene6kOlrug9mRJSHpdsIS8U6KKvkcLN0/edit?usp=sharing google doc], find student id and determine which server you need to use. This will be your server that you must use for the rest of the course.</p>

<H4>User ID/username</H4>

Your cloud <b>username</b> or '''user ID''' is one entered in the '''username''' column (ex: s123456).

<H4>IP address</H4>

Here are the IP addresses:

{| class="wikitable"
| '''server'''
| '''IP'''
|-
| server1
| 192.168.63.58
|-
| server2
| 192.168.63.59
|-
| server3
| 192.168.63.62
|}

<H4>Password</H4>

The password is initially the '''same''' as the DTU one. <b> Remember: If you lose the password we can not help you! Check with DTU.</b>

<H4>2 factor authentication</H4>

You will be prompted for a 2-factor authentication code. If you have not set it up with AIT, go to: https://account.activedirectory.windowsazure.com/proofup.aspx?proofup=1

<H2>Setting up your own computer/laptop</H2>
Make sure you are connected to the internet.<br>
Video: [https://video.dtu.dk/media/22126-1-VPNCloud/0_xvdqq6v6 How to download and install VPN. How to connect to the Metagenomic cloud servers].

<H3>Windows computer</H3>

<H4>VPN client</H4>

First, install an OpenVPN client. Download the necessary configuration file for our course cloud servers [https://teaching.healthtech.dtu.dk/material/22126/pfSense_DTU-UDP4-1195-metagenomic-02-config.ovpn here] and the VPN client program [https://openvpn.net/community-downloads/ here]

<H4>Terminal</H4>

Secondly, install MobaXterm. Download [https://mobaxterm.mobatek.net/download.html here]
Once installed you must create an '''ssh session''' in Mobaxterm and log on to your cloud server (Look at the Google docs depending on user ID and use the IP above). '''If you cannot seem to access the servers, you might need to run the OpenVPN client as administrator'''

<HR>

<H3>Mac computer</H3>

<H4>VPN client</H4>

Download Tunnelblick (the stable version):
https://tunnelblick.net/downloads.html

Direct download link:
https://tunnelblick.net/release/Tunnelblick_3.7.9a_build_5321.dmg

Download the configuration file: [https://teaching.healthtech.dtu.dk/material/22126/pfSense_DTU-UDP4-1195-metagenomic-02-config.ovpn here]

Put the file in a directory of your own choice

Right-click on the file.
Choose "open with" --> "TunnelBlick"
Or
Choose "open with" --> "Other.."
Chosse "Enable: All Apllications"
Click on "TunnelBlick"

Click on the TunnelBlick sign in the top bar of your computer (the same bar that shows your WiFi/Battery/Clock)
Choose "Connect Cloud_590"

Now you should have acces to the Cloud.

<H4>Terminal</H4>

Open your terminal. It is located in: /Applications/Utilities. Or search for "Terminal"

Depending on the server you will be using (see the google sheet), write:

<pre>
ssh -XC [USER ID]@[ID ADDRESS]
</pre>
replace the [USER ID] with the username you have been assigned to and [ID ADDRESS] is the IP of your server (see above).

Enter the password..
<p>Open the "XQuartz" and "Terminal" in the "Applications/Utilities" folder. You can close the window that the XQuartz program opens, but leave the program running in the background. If you dont have XQuartz installed you can install
it from [https://dl.bintray.com/xquartz/downloads/XQuartz-2.7.11.dmg here].
In the terminal write the command below and enter your password when asked for it. Depending on what you have installed you may have to install "Developer Tools" from Apple. You can also install the program "iTerm2" which is a very neat terminal.<br>
Log on to your cloud server using ssh and your cloud username (studXXX) and password (see above).

<HR>

<H3>Ubuntu/Linux</H3>

<H4>VPN client</H4>

On Ubuntu, use:
<pre>
sudo apt update
sudo apt install openvpn
</pre>

for other releases, please see the following: [https://openvpn.net/vpn-software-packages/ VPN Software Repository & Packages]. The config file is [https://teaching.healthtech.dtu.dk/material/22126/pfSense_DTU-UDP4-1195-metagenomic-02-config.ovpn here]. Then write:
<pre>
sudo openvpn pfSense_DTU-UDP4-1195-metagenomic-02-config.ovpn
</pre>

<H4>Terminal</H4>

Open your terminal and ssh using:

<pre>
ssh -XC [USER ID]@[ID ADDRESS]
</pre>

Program 2022

2024-03-19T16:03:30Z

WikiSysop: Created page with " '''REMEMBER TO BRING A LAPTOP COMPUTER FOR EXERCISES''' Lectures and exercises will take place on Discord (https://discord.gg/FBb2edFW). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 21st of January'''. <!-- This term, we will be using Piazza for class disc..."

'''REMEMBER TO BRING A LAPTOP COMPUTER FOR EXERCISES'''

Lectures and exercises will take place on Discord (https://discord.gg/FBb2edFW). Please register with your full name. Will use Discord for online classes and collaboration with your project partners.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 21st of January'''.



=== Course Program - January 2022 ===

<HR>
'''Monday, January 3'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course & Pre-test
([https://teaching.healthtech.dtu.dk/material/22126/2022/11_Introduction_to_course_GR.pdf Lecture slides])
([https://teaching.healthtech.dtu.dk/material/22126/2021/1_pre-test_2021.pdf Pre-test])</DD>
<DD>Gabriel Renaud</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2022/12_Introduction_to_NGS_GR.pdf Lecture slides]) </DD>
<DD>Gabriel Renaud</DD>

<DT>10:00am-10:45am</DT>
<DD>2nd and 3rd generation NGS Technologies
([https://teaching.healthtech.dtu.dk/material/22126/2022/13_Introduction_to_NGS_technology_GR.pdf Lecture slides])</DD>
<DD>Gabriel Renaud</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Tech talk group formation and group work
([https://teaching.healthtech.dtu.dk/material/22126/2022/14_Tech_Talks_GR.pdf Lecture slides])
([https://docs.google.com/spreadsheets/d/1yY6HH10z_OTOTHQUiZehCUrya4u-N2xyUWSeM-EE__o/edit?usp=sharing Student Groups 2022]) </DD>
<DD>Gabriel Renaud</DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our Computerome cloud ([[Logging on to Cloud system]])</DD>
<DD>Peter Wad Sackett,Josh Rubin, Nicola Vogel, Gabriel Renaud, </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Video lectures to watch from "Unix intro.." to "Touching upon..."])
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercises] possible answers [[Unix_answers|here]])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])</DD>
<DD>Peter Wad Sackett, Josh Rubin, Nicola Vogel </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Video lectures to watch from "Unix intro.." to "Touching upon..."])
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercises] possible answers [[Unix_answers|here]])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])</DD>
<DD>Peter Wad Sackett, Josh Rubin, Nicola Vogel </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Peter Wad Sackett, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

<HR>
'''Tueday, January 4'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-10:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2021/2_Data_Basics_SG.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD>Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2021/2_Data_preprocessing_SG.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD>Shyam Gopalakrishnan </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-2:15pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2022/23_Alignment_GR.pdf Lecture slides]) </DD>
<DD>Gabriel Renaud </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break'' </DD>

<DT>2:30pm-2:45pm</DT>
<DD>Brief reminder on probabilities and Bayesian theory ([https://teaching.healthtech.dtu.dk/material/22126/2022/24_Bayesian_reminder_GR.pdf Lecture slides]) </DD>
<DD>Gabriel Renaud</DD>

<DT>2:45pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD>Gabriel Renaud, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 5'''
<HR>

''Alignment & Genotyping''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Human Variation</DD>
<DD>Shyam Gopalakrishnan, ([https://teaching.healthtech.dtu.dk/material/22126/2021/FunctionalHumanVariation_SG.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Alignment postprocessing & variant calling ([https://teaching.healthtech.dtu.dk/material/22126/2022/41_post_alignment_variantcalling_GR.pdf Lecture slides])</DD>

<DD>Gabriel Renaud</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Postprocessing & variant calling ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]]) ([[SNP calling exercise]]) ([[SNP_calling_exercise_answers]])</DD>
<DD>Gabriel Renaud, Josh Rubin, Nicola Vogel </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: de novo assembly ([https://teaching.healthtech.dtu.dk/material/22126/2021/4_de_novo_assembly_course_SG.pdf Lecture slides])([http://teaching.healthtech.dtu.dk/material/22126/debruijn_handout.pdf Handout (TODO)]) </DD>
<DD>Shyam Gopalakrishnan</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD>Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

<HR>
'''Thursday, January 6'''
<HR>
''Metagenomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Metagenomics & Binning ([https://teaching.healthtech.dtu.dk/material/22126/2022/Metagenomics_binning.pdf Lecture slides])</DD>
<DD>Gisle Vestergaard</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: Kaiju: Taxonomic classification ([[Kaiju exercise]]) ([[Kaiju solution]]) </DD>
<DD>Gisle Vestergaard,Trine Zachariasen</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: Quantitative Metagenomics ([https://teaching.healthtech.dtu.dk/material/22126/2022/Quantitative_metagenomics_2.pdf Lecture slides])</DD>
<DD>Gisle Vestergaard</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Quantitative Metagenomics ([http://teaching.healthtech.dtu.dk/22126/index.php/QuantitativeMetagenomics Exercise]) ([http://teaching.healthtech.dtu.dk/22126/index.php/QuantitativeMetagenomicsSolution Solution]) </DD>
<DD>Gisle Vestergaard, Trine Zachariasen</DD>
</DL>

<BR>

<HR>
'''Friday, January 7'''
<HR>
''Cell-free DNA and recap test''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Cell-free DNA ([https://teaching.healthtech.dtu.dk/material/22126/2021/cfDNA_lecture_2020_SB.pdf Lecture slides])</DD>
<DD>Søren Besenbacher</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00am</DT>
<DD>Exercise: Cell free DNA ([[cfDNA exercise]])([[CfDNA_exercise_answers]])</DD>
<DD>Søren Besenbacher, Josh Rubin, Nicola Vogel</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2021/test_2021.pdf Test 2021])([https://teaching.healthtech.dtu.dk/material/22126/2021/test_2021_withA.pdf answers])</DD>
<DD>Gabriel Renaud, Josh Rubin, Nicola Vogel </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Time to work on this week's exercises</DD>
<DD>Gabriel Renaud, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

<HR>
'''Monday, January 10'''
<HR>

''RNA-seq and Cancer-seq''
<DL>

<DT>9:00am-9:45am</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2022/RNA-seq_10-01-2022.pdf Lecture slides])</DD>
<DD>Francesca Bertolini</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]]) ([[Rnaseq_exercise_answers]]) </DD>
<DD>Francesca Bertoloni, Josh Rubin, Nicola Vogel </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: Cancer-seq ([https://teaching.healthtech.dtu.dk/material/22126/2021/CancerGenomics_Izarzugaza%20copy.pdf Lecture slides]) </DD>
<DD>Elena Papaleo </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Cancer-seq ([[Cancerseq_exercise]]) ([[Cancerseq_exercise_answers]])</DD>
<DD>Adrian Otamendi, Josh Rubin, Nicola Vogel </DD>
</DL>

<BR>

<HR>
'''Tuesday, January 11'''
<HR>

''Genomic Epidemiology and tech talk''

<DL>
<DT>9:00am-9:55am</DT>
<DD>Exercise: Genomic Epidemiology ([[Genomic epidemiology exercise]]) ([[Genomic epidemiology solution]])</DD>
<DD>Shyam Gopalakrishnan</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Case story: Genomic Epidemiology ([https://teaching.healthtech.dtu.dk/material/22126/2021/Genomic_epidemiology_NGScourse_Jan2021.pdf Lecture])</DD>
<DD>Pimlapas Leekitecharoenphon (Shinny)</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-2:00pm </DT>
<DD>Tech talk work</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel </DD>

<DT>2:00pm-4:00pm </DT>
<DD>TechTalks Presentations</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel </DD>

</DL>

<BR>

<HR>
'''Wednesday, January 12'''
<HR>
''Ancient DNA & Project work''

<DL>
<DT>9:00am-10:00am</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2021/dtu_adna_2021.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00am-10:15am</DT>
<DD>''Break''</DD>

<DT>10:15am-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Josh Rubin, Nicola Vogel</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2021/82_Projects_GR.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2022/posters.tar.gz Examples from previous courses]) </DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>

<DT>1:45pm-4:00pm </DT>
<DD>Projects & Group formation, prepare your presentation for tomorrow. please write group names in the [https://docs.google.com/document/d/11OudbQ1DGY9GwCoFXtU29rAZmhlBokmPify5b0uOb5Q/edit?usp=sharing document for 2022]</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

<HR>
'''Thursday, January 13'''
<HR>
''Project presentation''
<DL>
<DT>9:00am-12:00pm</DT>
<DD>Project work/Prepare presentations for this afternoon</DD>
<DD>Please go to Discord for help, we will be available.</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-2:00pm <DT>
<DD>Project Presentations (what you will do) /Project work</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>

<DT>2:00pm-4:00pm </DT>
<DD>Project work</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

'''Friday, January 14'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

'''Monday, January 17'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

'''Tuesday, January 18'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

'''Wednesday, January 19'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

'''Thursday, January 20'''
<HR>
''Project Work & Poster Printing''
<DL>

<DD>Produce a PDF of your poster, presentation will online this year.</DD>
<DD>[http://teaching.healthtech.dtu.dk/material/22126/Posters.pdf Poster guide & requirements]</DD>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the [http://teaching.healthtech.dtu.dk/material/22126/exam.pdf Exam]</DD>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Shyam Gopalakrishnan, Josh Rubin, Nicola Vogel</DD>
</DL>

<BR>

'''Friday, January 21'''
<HR>
''Poster session - Online''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Poster session part ('''exam''')</DD>
</DL>

Program 2023

2024-03-19T16:02:53Z

WikiSysop: Created page with "'''REMEMBER TO BRING A LAPTOP COMPUTER FOR EXERCISES''' Lectures will be in person in building [https://goo.gl/maps/k4wYkMjTJ2HLHuyN8 303A] in auditorium 45. Offline discussions will take place on Discord (https://discord.gg/4yVB6vMG2n). Please register with your '''full name'''. Will use Discord for online classes and collaboration with your project partners. <!-- Lectures and exercises will take place on Discord (https://discord.gg/FBb2edFW). Please register with..."

'''REMEMBER TO BRING A LAPTOP COMPUTER FOR EXERCISES'''

Lectures will be in person in building [https://goo.gl/maps/k4wYkMjTJ2HLHuyN8 303A] in auditorium 45. Offline discussions will take place on Discord (https://discord.gg/4yVB6vMG2n). Please register with your '''full name'''. Will use Discord for online classes and collaboration with your project partners.



The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 20th of January 2023'''.



=== Course Program - January 2023 ===

<HR>
'''Monday, January 2'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2023/11_Introduction_to_course_GR.pdf Lecture slides])
</DD>
<DD>Gabriel Renaud</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2023/12_Introduction_to_NGS_GR.pdf Lecture slides]) </DD>
<DD>Gabriel Renaud</DD>

<DT>10:00am-10:45am</DT>
<DD>2nd and 3rd generation NGS Technologies
([https://teaching.healthtech.dtu.dk/material/22126/2023/13_Introduction_to_NGS_technology_GR.pdf Lecture slides])</DD>
<DD>Gabriel Renaud</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Tech talk group formation and group work
([https://teaching.healthtech.dtu.dk/material/22126/2023/14_Tech_Talks_GR.pdf Lecture slides])
([https://docs.google.com/spreadsheets/d/1-Add0-Zw1JvUaoPYMmHuxlkPxnF49jSffFZ6L2ajL4I/edit?usp=sharing Student Groups 2023]) </DD>
<DD>Gabriel Renaud</DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Peter Wad Sackett, Nicola Vogel, Louis Kraft, Gabriel Renaud </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Video lectures to watch from "Unix intro.." to "Touching upon..."])
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercises] possible answers [[Unix_answers|here]])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])</DD>
<DD>Peter Wad Sackett, Nicola Vogel, Louis Kraft </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Video lectures to watch from "Unix intro.." to "Touching upon..."])
([https://teaching.healthtech.dtu.dk/36610/index.php/UNIX Exercises] possible answers [[Unix_answers|here]])
([http://teaching.healthtech.dtu.dk/material/36610/UnixInstructions36610.pdf Unix Notes])</DD>
<DD>Peter Wad Sackett, Nicola Vogel, Louis Kraft </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Peter Wad Sackett, Nicola Vogel, Louis Kraft</DD>
</DL>

<BR>

<HR>
'''Tueday, January 3'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-10:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2023/21_Data_Basics_GR.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD>Gabriel Renaud, Nicola Vogel, Louis Kraft</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2023/22_Data_Preprocessing_GR.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD>Gabriel Renaud </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-2:15pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2023/23_Alignment_GR.pdf Lecture slides]) </DD>
<DD>Gabriel Renaud </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break'' </DD>

<DT>2:30pm-2:45pm</DT>
<DD>Brief reminder on probabilities and Bayesian theory ([https://teaching.healthtech.dtu.dk/material/22126/2023/24_Bayesian_reminder_GR.pdf Lecture slides]) </DD>
<DD>Gabriel Renaud</DD>

<DT>2:45pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD>Gabriel Renaud, Nicola Vogel, Louis Kraft</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 4'''
<HR>

''Alignment & Genotyping''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Human Variation</DD>
<DD>Gabriel Renaud, ([https://teaching.healthtech.dtu.dk/material/22126/2023/41_Functional_Human_Variation_GR.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Alignment postprocessing & variant calling ([https://teaching.healthtech.dtu.dk/material/22126/2023/42_post_alignment_variantcalling_GR.pdf Lecture slides])</DD>

<DD>Gabriel Renaud</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Postprocessing & variant calling ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]]) ([[SNP calling exercise]]) ([[SNP_calling_exercise_answers]])</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: de novo assembly ([https://teaching.healthtech.dtu.dk/material/22126/2023/43_de_novo_assembly_course_GR.pdf Lecture slides])([http://teaching.healthtech.dtu.dk/material/22126/debruijn_handout.pdf Handout]) </DD>
<DD>Gabriel Renaud</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

<HR>
'''Thursday, January 5'''
<HR>
''Metagenomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Metagenomics & Binning ([https://teaching.healthtech.dtu.dk/material/22126/2023/Metagenomics_binning.pdf Lecture slides])</DD>
<DD>Asker Brejnrod</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: Kaiju: Taxonomic classification ([[Kaiju exercise]]) ([[Kaiju solution]]) </DD>
<DD>Asker Brejnrod,Louis Kraft</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: Quantitative Metagenomics ([https://teaching.healthtech.dtu.dk/material/22126/2023/Quantitative_metagenomics.pdf Lecture slides])</DD>
<DD>Asker Brejnrod</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Quantitative Metagenomics ([[QuantitativeMetagenomics]]) ([[QuantitativeMetagenomicsSolution]]) </DD>
<DD>Asker Brejnrod, Louis Kraft</DD>
</DL>

<BR>

<HR>
'''Friday, January 6'''
<HR>
''Long read tech and recap test''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Long read sequencing ([https://teaching.healthtech.dtu.dk/material/22126/2021/cfDNA_lecture_2020_SB.pdf Lecture slides])</DD>
<DD>Shilpa Garg</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00am</DT>
<DD>Exercise: Long read technology ([[longread exercise]])([[longread_exercise_answers]])</DD>
<DD>Shilpa Garg, Josh Rubin, Louis Kraft</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2023/test_2023.pdf Test 2023])([https://teaching.healthtech.dtu.dk/material/22126/2023/test_2023_withA.pdf answers])</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Time to work on this week's exercises</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

<HR>
'''Monday, January 9'''
<HR>

''RNA-seq and Ancient DNA''
<DL>

<DT>9:00am-9:45am</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2023/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])</DD>
<DD>Kristoffer Vitting-Seerup</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: RNAseq ([shorturl.at/nzBPQ Rnaseq_exercise]) ([[Rnaseq_exercise_answers]]) </DD>
<DD>Kristoffer Vitting-Seerup, Josh Rubin, Louis Kraft </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2021/dtu_adna_2021.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>2:00pm-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Josh Rubin, Louis Kraft</DD>

<BR>

<HR>
'''Tuesday, January 10'''
<HR>

''Genomic Epidemiology and tech talk''

<DL>
<DT>9:00am-9:55am</DT>
<DD>Exercise: Genomic Epidemiology ([[Genomic epidemiology exercise]]) ([[Genomic epidemiology solution]])</DD>
<DD>Gabriel Renaud</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Case story: Genomic Epidemiology ([https://teaching.healthtech.dtu.dk/material/22126/2023/Genomic_epidemiology_NGScourse_Jan2023.pdf Lecture])</DD>
<DD>Pimlapas Leekitecharoenphon (Shinny)</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-2:00pm </DT>
<DD>Tech talk work</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft </DD>

<DT>2:00pm-4:00pm </DT>
<DD>TechTalks Presentations</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft </DD>

</DL>

<BR>

<HR>
'''Wednesday, January 11'''
<HR>
''CancerSeq & Project work''

<DT>9:00am-9:45am </DT>
<DD>Lecture: Cancer-seq ([https://teaching.healthtech.dtu.dk/material/22126/2023/Cancer_Genomics_EP_2023_2.pdf Lecture slides]) </DD>
<DD>Elena Papaleo </DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: Cancer-seq ([[Cancerseq_exercise]]) ([[Cancerseq_exercise_answers]])</DD>
<DD>Adrian Otamendi, Josh Rubin, Louis Kraft </DD>
</DL>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2023/82_Projects_GR.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>

<DT>1:45pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1Flu6t3RRm44ExyGli6X_yOqoaK1_02yWZiOWeIPlgSc/edit?usp=sharing document for 2023]</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

<HR>
'''Thursday, January 12'''
<HR>
''Project presentation''
<DL>
<DT>9:00am-12:00pm</DT>
<DD>Project work/Prepare your project outline for this afternoon</DD>

<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-2:00pm <DT>
<DD>Project outlines (what you will do) /Project work</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>

<DT>2:00pm-4:00pm </DT>
<DD>Project work</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

'''Friday, January 13'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

'''Monday, January 16'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

'''Tuesday, January 17'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

'''Wednesday, January 18'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

'''Thursday, January 19'''
<HR>
''Project Work & Poster Printing''
<DL>

<DD>Please print your poster.</DD>
<DD>[http://teaching.healthtech.dtu.dk/material/22126/Posters.pdf Poster guide & requirements]</DD>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD>Gabriel Renaud, Josh Rubin, Louis Kraft</DD>
</DL>

<BR>

'''Friday, January 20'''
<HR>
''Poster session''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Poster session part ('''exam''')</DD>
</DL>

MediaWiki:Sidebar

2024-03-19T16:01:49Z

WikiSysop: Created page with " * navigation ** https://teaching.healthtech.dtu.dk/|Course List ** https://teaching.healthtech.dtu.dk/22126/|Course 22126 * TOOLBOX"

* navigation
** https://teaching.healthtech.dtu.dk/|Course List
** https://teaching.healthtech.dtu.dk/22126/|Course 22126
* TOOLBOX

Cancerseq exercise answers

2024-03-19T16:00:13Z

WikiSysop: Created page with "'''Q1''' We run: <pre> gatk Mutect2 -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -normal TCRBOA2-N-WEX -L chr10:3100000-5100000 --germline-resource /home/databases/databases/GRCh38/somatic-hg38_af-only-gnomad.hg38.vcf.gz -O TCRBOA2.vcf.gz </pre> Then either: <pre>..."

'''Q1'''

We run:
<pre>
gatk Mutect2 -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -normal TCRBOA2-N-WEX -L chr10:3100000-5100000 --germline-resource /home/databases/databases/GRCh38/somatic-hg38_af-only-gnomad.hg38.vcf.gz -O TCRBOA2.vcf.gz
</pre>

Then either:
<pre>
bcftools view -H TCRBOA2.vcf.gz|wc -l
bcftools stats TCRBOA2.vcf.gz
zgrep -v "^#" TCRBOA2.vcf.gz|wc -l
zcat TCRBOA2.vcf.gz |grep -v "#"|wc -l
</pre>

Will give you 9 variants.

'''Q2'''

First we run:
<pre>
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-T.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-N.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>

Then counting the number of variants in the tumor:

<pre>
bcftools view -H TCRBOA2-T.vcf.gz | wc -l
</pre>
424

and normal sample:

<pre>
bcftools view -H TCRBOA2-N.vcf.gz | wc -l
</pre>
413

The tumor has more variants which is expected due to a higher amount of somatic variants.

'''Q3'''

<pre>
gatk FilterMutectCalls -V TCRBOA2.vcf.gz -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -O TCRBOA2_filtered.vcf.gz
</pre>

We can insect visually:

<pre>
bcftools view -H TCRBOA2_filtered.vcf.gz |less -S
</pre>

Or to classify filters in a straightforward manner:

<pre>
bcftools view -H TCRBOA2_filtered.vcf.gz |cut -f 7 |tr ";" "\n" |sort |uniq -c |sort
</pre>

You get:
<pre>
1 slippage
2 haplotype
2 PASS
2 weak_evidence
3 clustered_events
3 map_qual
4 strand_bias
6 normal_artifact
</pre>

see some notes on the meaning of these filters [https://www.biorxiv.org/content/biorxiv/early/2019/12/02/861054/DC1/embed/media-1.pdf?download=true here]

'''Q4'''
Running
<pre>
java -jar /usr/local/bin/SnpSift.jar annotate /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz TCRBOA2_filtered.vcf.gz | bgzip -c > TCRBOA2_filtered_anno.vcf.gz
bcftools view -H -f PASS TCRBOA2_filtered_anno.vcf.gz | less -S
</pre>

should give you:
<pre>
chr10 3165513 rs9423502 G C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=39,119|0,3;DP=168;ECNT=1;GERMQ=93;MBQ=38,39;MFRL=225,184;MMQ=60,60;MPOS=4;NALOD=1.86;NLOD=21.37;POPAF=1.32;TLOD=6.08;CAF=[0.9454,0.05464];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;NSM;OTHERKG;PH3;REF;RS=9423502;RSPOS=3207705;S3D;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050300000a01150517000101;WGT=1;dbSNPBuildID=119 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:79,0:0.014:79:41,0:38,0:19,60,0,0 0/1:79,3:0.054:82:41,2:38,1:20,59,0,3
chr10 4972935 . A T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=8,84|0,4;DP=102;ECNT=1;GERMQ=93;MBQ=36,34;MFRL=302,495;MMQ=50,40;MPOS=20;NALOD=1.58;NLOD=11.08;POPAF=1.62;TLOD=8.82GT:AD:AF:DP:F1R2:F2R1:SB 0/0:37,0:0.026:37:16,0:20,0:1,36,0,0 0/1:55,4:0.083:59:26,2:28,2:7,48,0,4
</pre>

The ID "rs9423502" is a dbSNP ID so the SNP at 3165513 was previously found whereas 4972935 was not.

'''Q5'''

The SNP can be found here: https://www.ncbi.nlm.nih.gov/snp/?term=rs9423502

Generally, the prevalence of the SNPs is relatively low (2-5%) which indicates that there is a potential role for diving cancer.

'''Q6'''

The variant on chromosome 18 are missense, potentially deleterious and have the COSMIC ID: COSV99493765. In the COSMIC database, it hits the DYM gene and is mostly found mutated in liver and prostate.

'''Q7'''

Often enough, around 6% in certain cases

'''Q8'''

kidney but the confidence is low as the prediction score is a virtual tie with liver.

Cancerseq exercise

2024-03-19T15:59:51Z

WikiSysop:

<H2>Overview</H2>

Adapted from an original exercise by Marcin Krzystanek and Aron Eklund.

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "cancerseq"
<LI>Navigate to the directory you just created.
</OL>

In this exercise, we learn the differences between standard genotyping and genotyping for cancer-seq:
<OL>
<LI>Somatic mutation calling
<LI>Interpretation of the resulting somatic mutations
</OL>

<H2>Somatic mutation calling</H2>

<H3> MuTect2 </H3>

We use MuTect2, a somatic mutation caller that identifies both SNV and indels. It produces a VCF-file, although the output of Mutect2 has some information specific for somatic variants. See [https://samtools.github.io/hts-specs/VCFv4.2.pdf here] for specs.

A big difference in cancer-seq variant calling using Mutect2 is that there are no ploidy assumptions. This accommodates tumor data that can have many copy number variants (CNVs).

Mutect2 is computationally intensive so we recommend parallelizing if possible. One way to achieve this is to split processes by chromosomes (calling variants for each chromosome and then merging vcf-files.)

Since we neither have the time nor the capacity to process the entire genome during our exercises, we will call somatic mutations on a small part of chromosome 10, from the 3,100,000th to the 5,100,000th base pair, which is set with the -L option. We have 2 files, one for the normal tissue:
<pre>
/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam
</pre>

and for the tumor one:
<pre>
/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam
</pre>

The reference can be found:
<pre>
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa
</pre>

You can now run Mutech2
<pre>
gatk Mutect2 -R [reference] -I [BAM NORMAL SAMPLE] -I [BAM TUMOR SAMPLE] -normal [ID of the NORMAL SAMPLE] -L [REGION] --germline-resource /home/databases/databases/GRCh38/somatic-hg38_af-only-gnomad.hg38.vcf.gz -O [OUTPUT VCF]
</pre>

please remember that the region is in the following format:
<pre>
[CHR NAME]:[START COORD]-[END COORD]
</pre>

The option germline resource specifies the frequency of germline mutations in a population. You can call your output TCRBOA2.vcf.gz.

Take a look at the resulting VCF file, try to count the number of raw variants either using bcftools or standard UNIX tools (zgrep or zcat+grep):

'''Q1'''

How many variants did you find?

<H3> Compare with calling all variants </H3>

Just for comparison, we try to call all variants in the interval for the germline and the tumor sample.

<pre>
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-T.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-N.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>
Count the variant lines in each with the same command as above:

<pre>
bcftools view -H TCRBOA2-T.vcf.gz | wc -l
bcftools view -H TCRBOA2-N.vcf.gz | wc -l
</pre>

'''Q2'''
Where do you find the highest number of raw variants? Does that make biological sense? What is the difference between the two numbers and does it match above?

<H3> Filtering and annotating variants </H3>

Before continuing, we need to filter the raw vcf-output to only get confident variants:

<pre>
gatk FilterMutectCalls -V TCRBOA2.vcf.gz -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -O TCRBOA2_filtered.vcf.gz
</pre>

Try to look at the variants.

'''Q3''' What does it look like in the filter column (7th column)? What kind of filters were applied?

To add some extra information to the VCF file, we will also annotate with the dbSNP IDs of known SNP. HaplotypeCaller can do this as it calls variants, but using Mutect2 we need to do it ourselves:

<pre>
java -jar /home/ctools/snpEff/SnpSift.jar annotate /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz [INPUT VCF]
</pre>

The command above produces output on the STDOUT. The input is the VCF you produced with FilterMutectCalls. Try to produce a file called "TCRBOA2_filtered_anno.vcf.gz". Remember to use "bgzip -c". Note that this command may take up to 3 minutes. Now try to filter mutational calls by selecting those with Mutect "PASS" annotation.

<pre>
bcftools view -H -f PASS TCRBOA2_filtered_anno.vcf.gz
</pre>

You should at least see this line (without the header). Don't forget you can scroll to the sides!

<pre>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCRBOA2-N-WEX TCRBOA2-T-WEX
chr10 3165513 rs9423502 G C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=39,119|0,3;DP=168;ECNT=1;GERMQ=93;MBQ=38,39;MFRL=225,184;MMQ=60,60;MPOS=4;NALOD=1.86;NLOD=21.37;POPAF=1.32;TLOD=6.08;CAF=[0.9454,0.05464];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;NSM;OTHERKG;PH3;REF;RS=9423502;RSPOS=3207705;S3D;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050300000a01150517000101;WGT=1;dbSNPBuildID=119 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:79,0:0.014:79:41,0:38,0:19,60,0,0 0/1:79,3:0.054:82:41,2:38,1:20,59,0,3
</pre>

There is a lot of information in the line. A brief explanation of each part of the line above is in the header of the VCF file (pipe into "less -S" to look at it, but it is also difficult to interpret).

Let's focus on the FORMAT lines. The per sample information kept here is organized by GT:AD:AF:DP:F1R2:F2R1:SB and the values are kept in the two proceeding columns, also separated by colons. The first starting with '''0/0''' refers to the normal sample, whereas the column beginning with '''0/1''' refers to the tumor. This means that the tumor is heterozygote (REF/ALT) for the mutation which is not seen in the germline sample at all (it is REF/REF).

After genotype (GT) we have allelic depth (AD) which is "79,3" (i.e. 79 reads and 3 reads for the reference and mutant allele respectively) in the tumor and "79,0" in the normal sample. Then comes allelic frequency, which is a fraction of the mutant allele out of all aligned bases in this position and the depth. We will skip the remaining values for F1R2, F2R1 and SB for now.

'''Q4'''

There should be 2 variants, which one was already found in previous studies (i.e. is documented in dbSNP)?

<H2> Interpretation of the resulting somatic mutations </H2>

A list of chromosome coordinates is kind of hard to interpret. Here are some ways to approach the results.

<H3> Variant annotation with dbSNP, Variant effect predictor and COSMIC </H3>

Find the RS identifier from the cancer mutation with a dbSNP ID and look it up at [https://www.ncbi.nlm.nih.gov/snp/ dbSNP].

'''Q5'''

Find the frequency table tab. Is your mutation common in some populations? What does a high frequency tell you about its role in cancer?
So far you have processed and analyzed only a small section of chromosome 10.

Now, let us analyze a bigger portion of the genome. Pick your favorite chromosome and find the corresponding VCF file on the server. For example, if you choose chromosome 7, you would use this file:

<pre>
/home/projects/22126_NGS/exercises/cancer_seq/chr7.vcf
</pre>

Hint: your results will be more interesting if you pick chromosomes: 18!

Filter the VCF to retain only the lines marked as "PASS".

<pre>
grep "PASS" /home/projects/22126_NGS/exercises/cancer_seq/chr7.vcf > chr7_filtered.vcf
</pre>

Download the filtered VCF to your own computer using:
<pre>
scp [userID]@pupil1:/path/to/file .
</pre>
and submit it to the [http://www.ensembl.org/Tools/VEP VEP website] using default settings. When the results become available, look in the "Somatic status" column. Are there any known cancer mutations?

If you find a known cancer mutation, find its COSMIC identifier (COSM######, e.g. COSM4597270) in the "existing variant" column. Search for your COSMIC identifier in the [http://cancer.sanger.ac.uk/cosmic COSMIC database].

'''Q6'''

In which tissues is this mutation found?

<H3>cBioPortal</H3>

Go to [http://www.cbioportal.org/ cBioPortal], a website that provides tools to analyze several large cancer sequencing datasets. In Quick Select, choose "TCGA PanCancer Atlas studies". Then press "Query by Gene" and type in the name of the gene that was hit by this mutation. Choose "mutations" as we have not looked at Copy Number Alterations. Press "Submit Query". Look at the barcharts and play around with the options.

'''Q7'''

How often is this gene mutated in various cancer types?

<H3> Inference of tissue of origin </H3>

Next, we will do some analysis on a VCF file containing somatic mutations found throughout the entire genome:

<pre>
/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2_filtered.vcf.gz
</pre>

Unlike VEP, TumorTracer requires VCF files to have the header information. Thus, we will filter this VCF file to retain: 1) header lines (which begin with "#"), and 2) data lines with a PASS call.
<pre>
zgrep -E "^#|PASS" /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2_filtered.vcf.gz > TCRBOA2_filtered_pass.vcf
</pre>

Download the VCF file and submit it to the https://services.healthtech.dtu.dk/services/TumorTracer-1.1/ TumorTracer server]. Make sure to specify that this VCF was generated '''using GRCh38 coordinates'''.

'''Q8'''

What tissue does TumorTracer predict? Is it a confident prediction?

<p>Congratulations you finished the exercise!</p>

<HR>

Please find answers [[Cancerseq_exercise_answers]].

Cancerseq exercise

2024-03-19T15:58:01Z

WikiSysop: Created page with " <H2>Overview</H2> Adapted from an original exercise by Marcin Krzystanek and Aron Eklund. First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "cancerseq" <LI>Navigate to the directory you just created. </OL> In this exercise, we learn the differences between standard genotyping and genotyping for cancer-seq: <OL> <LI>Somatic mutation calling <LI>Interpretation of the resulting somatic mutations </OL> <H2>Somatic mutation calling</..."

<H2>Overview</H2>

Adapted from an original exercise by Marcin Krzystanek and Aron Eklund.

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "cancerseq"
<LI>Navigate to the directory you just created.
</OL>

In this exercise, we learn the differences between standard genotyping and genotyping for cancer-seq:
<OL>
<LI>Somatic mutation calling
<LI>Interpretation of the resulting somatic mutations
</OL>

<H2>Somatic mutation calling</H2>

<H3> MuTect2 </H3>

We use MuTect2, a somatic mutation caller that identifies both SNV and indels. It produces a VCF-file, although the output of Mutect2 has some information specific for somatic variants. See [https://samtools.github.io/hts-specs/VCFv4.2.pdf here] for specs.

A big difference in cancer-seq variant calling using Mutect2 is that there are no ploidy assumptions. This accommodates tumor data that can have many copy number variants (CNVs).

Mutect2 is computationally intensive so we recommend parallelizing if possible. One way to achieve this is to split processes by chromosomes (calling variants for each chromosome and then merging vcf-files.)

Since we neither have the time nor the capacity to process the entire genome during our exercises, we will call somatic mutations on a small part of chromosome 10, from the 3,100,000th to the 5,100,000th base pair, which is set with the -L option. We have 2 files, one for the normal tissue:
<pre>
/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam
</pre>

and for the tumor one:
<pre>
/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam
</pre>

The reference can be found:
<pre>
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa
</pre>

You can now run Mutech2
<pre>
gatk Mutect2 -R [reference] -I [BAM NORMAL SAMPLE] -I [BAM TUMOR SAMPLE] -normal [ID of the NORMAL SAMPLE] -L [REGION] --germline-resource /home/databases/databases/GRCh38/somatic-hg38_af-only-gnomad.hg38.vcf.gz -O [OUTPUT VCF]
</pre>

please remember that the region is in the following format:
<pre>
[CHR NAME]:[START COORD]-[END COORD]
</pre>

The option germline resource specifies the frequency of germline mutations in a population. You can call your output TCRBOA2.vcf.gz.

Take a look at the resulting VCF file, try to count the number of raw variants either using bcftools or standard UNIX tools (zgrep or zcat+grep):

'''Q1'''

How many variants did you find?

<H3> Compare with calling all variants </H3>

Just for comparison, we try to call all variants in the interval for the germline and the tumor sample.

<pre>
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-T.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-N.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>
Count the variant lines in each with the same command as above:

<pre>
bcftools view -H TCRBOA2-T.vcf.gz | wc -l
bcftools view -H TCRBOA2-N.vcf.gz | wc -l
</pre>

'''Q2'''
Where do you find the highest number of raw variants? Does that make biological sense? What is the difference between the two numbers and does it match above?

<H3> Filtering and annotating variants </H3>

Before continuing, we need to filter the raw vcf-output to only get confident variants:

<pre>
gatk FilterMutectCalls -V TCRBOA2.vcf.gz -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -O TCRBOA2_filtered.vcf.gz
</pre>

Try to look at the variants.

'''Q3''' What does it look like in the filter column (7th column)? What kind of filters were applied?

To add some extra information to the VCF file, we will also annotate with the dbSNP IDs of known SNP. HaplotypeCaller can do this as it calls variants, but using Mutect2 we need to do it ourselves:

<pre>
java -jar /home/ctools/snpEff/SnpSift.jar annotate /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz [INPUT VCF]
</pre>

The command above produces output on the STDOUT. The input is the VCF you produced with FilterMutectCalls. Try to produce a file called "TCRBOA2_filtered_anno.vcf.gz". Remember to use "bgzip -c". Note that this command may take up to 3 minutes. Now try to filter mutational calls by selecting those with Mutect "PASS" annotation.

<pre>
bcftools view -H -f PASS TCRBOA2_filtered_anno.vcf.gz
</pre>

You should at least see this line (without the header). Don't forget you can scroll to the sides!

<pre>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TCRBOA2-N-WEX TCRBOA2-T-WEX
chr10 3165513 rs9423502 G C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=39,119|0,3;DP=168;ECNT=1;GERMQ=93;MBQ=38,39;MFRL=225,184;MMQ=60,60;MPOS=4;NALOD=1.86;NLOD=21.37;POPAF=1.32;TLOD=6.08;CAF=[0.9454,0.05464];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;NSM;OTHERKG;PH3;REF;RS=9423502;RSPOS=3207705;S3D;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050300000a01150517000101;WGT=1;dbSNPBuildID=119 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:79,0:0.014:79:41,0:38,0:19,60,0,0 0/1:79,3:0.054:82:41,2:38,1:20,59,0,3
</pre>

There is a lot of information in the line. A brief explanation of each part of the line above is in the header of the VCF file (pipe into "less -S" to look at it, but it is also difficult to interpret).

Let's focus on the FORMAT lines. The per sample information kept here is organized by GT:AD:AF:DP:F1R2:F2R1:SB and the values are kept in the two proceeding columns, also separated by colons. The first starting with '''0/0''' refers to the normal sample, whereas the column beginning with '''0/1''' refers to the tumor. This means that the tumor is heterozygote (REF/ALT) for the mutation which is not seen in the germline sample at all (it is REF/REF).

After genotype (GT) we have allelic depth (AD) which is "79,3" (i.e. 79 reads and 3 reads for the reference and mutant allele respectively) in the tumor and "79,0" in the normal sample. Then comes allelic frequency, which is a fraction of the mutant allele out of all aligned bases in this position and the depth. We will skip the remaining values for F1R2, F2R1 and SB for now.

'''Q4'''

There should be 2 variants, which one was already found in previous studies (i.e. is documented in dbSNP)?

<H2> Interpretation of the resulting somatic mutations </H2>

A list of chromosome coordinates is kind of hard to interpret. Here are some ways to approach the results.

<H3> Variant annotation with dbSNP, Variant effect predictor and COSMIC </H3>

Find the RS identifier from the cancer mutation with a dbSNP ID and look it up at [https://www.ncbi.nlm.nih.gov/snp/ dbSNP].

'''Q5'''

Find the frequency table tab. Is your mutation common in some populations? What does a high frequency tell you about its role in cancer?
So far you have processed and analyzed only a small section of chromosome 10.

Now, let us analyze a bigger portion of the genome. Pick your favorite chromosome and find the corresponding VCF file on the server. For example, if you choose chromosome 7, you would use this file:

<pre>
/home/projects/22126_NGS/exercises/cancer_seq/chr7.vcf
</pre>

Hint: your results will be more interesting if you pick chromosomes: 18!

Filter the VCF to retain only the lines marked as "PASS".

<pre>
grep "PASS" /home/projects/22126_NGS/exercises/cancer_seq/chr7.vcf > chr7_filtered.vcf
</pre>

Download the filtered VCF to your own computer using:
<pre>
scp [userID]@pupil1:/path/to/file .
</pre>
and submit it to the [http://www.ensembl.org/Tools/VEP VEP website] using default settings. When the results become available, look in the "Somatic status" column. Are there any known cancer mutations?

If you find a known cancer mutation, find its COSMIC identifier (COSM######, e.g. COSM4597270) in the "existing variant" column. Search for your COSMIC identifier in the [http://cancer.sanger.ac.uk/cosmic COSMIC database].

'''Q6'''

In which tissues is this mutation found?

<H3>cBioPortal</H3>

Go to [http://www.cbioportal.org/ cBioPortal], a website that provides tools to analyze several large cancer sequencing datasets. In Quick Select, choose "TCGA PanCancer Atlas studies". Then press "Query by Gene" and type in the name of the gene that was hit by this mutation. Choose "mutations" as we have not looked at Copy Number Alterations. Press "Submit Query". Look at the barcharts and play around with the options.

'''Q7'''

How often is this gene mutated in various cancer types?

<H3> Inference of tissue of origin </H3>

Next, we will do some analysis on a VCF file containing somatic mutations found throughout the entire genome:

<pre>
/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2_filtered.vcf.gz
</pre>

Unlike VEP, TumorTracer requires VCF files to have the header information. Thus, we will filter this VCF file to retain: 1) header lines (which begin with "#"), and 2) data lines with a PASS call.
<pre>
zgrep -E "^#|PASS" /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2_filtered.vcf.gz > TCRBOA2_filtered_pass.vcf
</pre>

Download the VCF file and submit it to the [https://services.healthtech.dtu.dk/service.php?TumorTracer-1.1 TumorTracer server]. Make sure to specify that this VCF was generated '''using GRCh38 coordinates'''.

'''Q8'''

What tissue does TumorTracer predict? Is it a confident prediction?

<p>Congratulations you finished the exercise!</p>

<HR>

Please find answers [[Cancerseq_exercise_answers here]].

Genomic epidemiology solution

2024-03-19T15:56:58Z

WikiSysop: Created page with "Q1. We can use the fastx tools to get this information <pre> fastx_readlength.py --i reads_R1.fastq.gz --gz fastx_readlength.py --i reads_R2.fastq.gz --gz </pre> This should give you the answer. Reads: 2*1408847 = 2,817,694; Bases: 284587094 Task1 - can be accomplished using a quality control tool like fastqc. <pre> mkdir fastqc fastqc -o fastqc reads_R*.fastq.gz firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html & </pre> Q2. There are no overrepresented..."

Q1. We can use the fastx tools to get this information
<pre>
fastx_readlength.py --i reads_R1.fastq.gz --gz
fastx_readlength.py --i reads_R2.fastq.gz --gz
</pre>
This should give you the answer.
Reads: 2*1408847 = 2,817,694; Bases: 284587094

Task1 - can be accomplished using a quality control tool like fastqc.
<pre>
mkdir fastqc
fastqc -o fastqc reads_R*.fastq.gz
firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html &
</pre>

Q2. There are no overrepresented sequences, but to ensure that there are no stray adapters at all, we will still perform adapter removal using AdapterRemoval - you can use any other tool to perform this task, e.g. cutadapt, leeHom etc.

Task2 - can be accomplished using AdapterRemoval
<pre>
AdapterRemoval --file1 reads_R1.fastq.gz --file2 reads_R2.fastq.gz --adapter1 ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC --adapter2 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC --basename reads --gzip --trimqualities --minquality 20 --minlength 40
</pre>

Q3. You can use this command to generate the histogram of the kmer coverages.
<pre>
gzip -dc reads.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o reads_jellyfish -C /dev/fd/0
jellyfish histo reads_jellyfish > reads.histo
</pre>
Now we can use R to plot it. Start R and use these commands to plot the histogram and save it as a pdf.
<pre>
dat=read.table("reads.histo")
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8)
dev.print("reads.histo.pdf", device=pdf)
</pre>
You can see from the plot that the average k-mer coverage ~ 41; so yes the depth is high enough to do error correction.

Task 3: Use jellyfish and musket - similar to the denovo exercise - to do error correction. <pre>
jellyfish stats reads_jellyfish

musket -k 15 XYZ -p 1 -omulti reads.cor -inorder reads.pair1.truncated.gz reads.pair2.truncated.gz -zlib 1
mv reads.cor.0 reads.pair1.cor.truncated.gz
mv reads.cor.1 reads.pair2.cor.truncated.gz

# If Musket wont run copy the data to your directory:
# cp /home/projects/22126_NGS/exercises/genomic_epi/reads.pair*.cor.truncated.gz .
</pre>

Task 4: Make an assembly using SOAPdenovo
<pre>
SOAPdenovo-127mer pregraph -s unknown.soap.conf -K 35 -p 2 -o initial
SOAPdenovo-127mer contig -g initial
</pre>
<p>Now you should have a file called "initial.contig", we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Lets only map the first 100.000 reads - this should be enough.</p>

<pre>
zcat reads.pair1.cor.truncated.gz | head -n 100000 > reads_sample_1.fastq
zcat reads.pair2.cor.truncated.gz | head -n 100000 > reads_sample_2.fastq

bwa index initial.contig
bwa mem initial.contig reads_sample_1.fastq reads_sample_2.fastq | samtools view -Sb - > initial.sample.bam

samtools view initial.sample.bam | cut -f9 > initial.insertsizes.txt

R
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean
sd(a.v[a.v >= mn & a.v <= mx]) # sd
</pre>


<p>Update the unknown.soap.conf with the correct insert size and run de novo assembly of the reads using K=65 (this a fairly good K for this data).</p>
<pre>
SOAPdenovo-127mer all -s unknown.soap.conf -K 65 -p 2 -o asmK65
</pre>

Q4. Scaffolds: 150, Contigs: 416, Scaffold N50: 376655, Contig N50: 70749

Q5. Yes, ST-313.

Task 5: Alignment to the sequence type reference genome.
<pre>
bwa index reference.fa
bwa mem -t 2 -R "@RG\tID:ST313\tSM:ST313\tPL:ILLUMINA" reference.fa reads.pair1.truncated.gz \
reads.pair2.truncated.gz | samtools view -Sb - > aln.bam
</pre>

Q6.
<pre>
samtools view -u -q 30 aln.bam | samtools sort -O BAM -o aln.sort.bam -
samtools index aln.sort.bam

bedtools genomecov -ibam aln.sort.bam > aln.cov
R --vanilla aln.cov aln.cov genome < /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R
evince aln.cov.pdf &
</pre>

Around 99% of the genome is at 10x or better coverage.

Task 6: Remove duplicates, sort the resulting bam and index it.
<pre>
java -Xmx2g -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates INPUT=aln.sort.bam OUTPUT=aln.sort.rmdup.bam ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
samtools index aln.sort.rmdup.bam
</pre>

Q7. 104517 reads are duplicates

Q8. 62 variants was called, 13 are indels.

Q9. 49 SNPs

Q10. Our strain is the closest to "C". A Congo 2002 strain of Salmonella Typhimurium ST313.

Q11. Aminoglycoside, Beta-lactam, Sulphonamid and Trimethoprim

Genomic epidemiology exercise

2024-03-19T15:56:21Z

WikiSysop: Created page with "<H3>Overview</H3> You are a bioinformatician working with epidemiology in Copenhagen. In the last couple of days more and more people have been admitted to the Hospitals in the region with severe bacteremia (bacteria in the bloodstream) and several patients, especially infants and young children, have had fatal outcomes. Luckily the probable causative infectious bacteria has been isolated and sequenced on a desktop sequencer and now it is your job to find out: <OL> <LI..."

<H3>Overview</H3>

You are a bioinformatician working with epidemiology in Copenhagen. In the last couple of days more and more people have been admitted to the Hospitals in the region with severe bacteremia (bacteria in the bloodstream) and several patients, especially infants and young children, have had fatal outcomes. Luckily the probable causative infectious bacteria has been isolated and sequenced on a desktop sequencer and now it is your job to find out:

<OL>
<LI>What is it?
<LI>Have we seen it before?
<LI>How can we treat it (is it drug-resistant)?
</OL>

<HR>

<H3>Preparation</H3>

Make a copy of the data

<pre>
mkdir genomic_epi
cd genomic_epi
cp /home/projects/22126_NGS/exercises/genomic_epi/ref* .
cp /home/projects/22126_NGS/exercises/genomic_epi/unk* .
cp /home/projects/22126_NGS/exercises/genomic_epi/data/reads* .
</pre>

<HR>

<H3>Preprocessing</H3>

Lets look at the quality of the number of reads that we have and the quality data. For the fastx_readlength.py command the output is "avg readlength, min length, max length, no. reads and no. bases".

<b>Q1. How many reads and bases do we have in total? (Hint: Use tools from previous lectures to figure this out) </b>

<b>Task1: Run quality control using fastqc to check for quality of data.</b>

<b>Q2. Are there any aspects of the data you would want to correct? Would you trim adapters, for example? </b>

<b>Task2: Lets trim the data, the sequencing adapters and primers used are (even though there are no overrepresented seqs in R2):</b>
<pre>
Adaptor for R1: ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC
Adaptor for R2: GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC
</pre>

<HR>

<H3>Error correction</H3>

Count kmers and let's plot the kmer distribution - again we count 15-mers because we are looking at a bacterial genome.

<b>Q3. Where is the peak at? Do we have enough coverage of the genome to correct the reads (eg. above 15X)? Hint: use jellyfish count and jellyfish histo to generate histogram of kmer depths</b>

<p> Let's correct the reads using Musket, here we use the kmer distribution as above to identify "true" and "error" kmers and pass through the reads correcting true reads. First, get the number of distinct k-mers in the dataset and then run musket. Replace "XYZ" in the musket command with the number of distinct k-mers before running it. </p>

<b>Task 3: Use error correction command from previous exercises to do the error correction. If musket takes too long or won't run, use pre-generated files. </b>

<HR>

<H2>What is it?</H2>

<HR>

<H3>Identify species</H3>

Let's identify the species. Here we can use a program developed by the [http://www.genomicepidemiology.org/ Center for Genomic Epidemiology] to quickly identify species from either reads or an assembly. What it is doing is to use a database of 16S rRNA genes and then map the reads to the database. The reads that map to the 16S rRNA database - enriched 16S rRNA reads - are then used to perform a ''denovo'' assembly returning a full-length 16S rRNA from the sequence. This is then compared with blast to known 16S rRNAs and output is written in ssu.out.<br><br>NB: 16S rRNA is not the best method to predict species - it has an accuracy of around 80%, we could use k-mers instead. If you are interested in this you can have a look at this [http://jcm.asm.org/content/52/5/1529.short paper]. You can try a kmer-method [http://cge.cbs.dtu.dk/services/KmerFinder/ here]

<pre>
speciesfinder.py reads.pair1.cor.truncated.gz --Mbases 20 --n 2 --tax --sample pathogenic_bacteria
</pre>

When it is done open the output file (inside the pathogenic_bacteria folder called ssu.out), the first line is the best match to our assembly in the database. The second column is whether we can trust the prediction or not. <b> Which species do we think it is?</b> This particular species is known to cause gastroenteritis with more than 90 million cases globally each year, with 155000 deaths. Most human infections are self-limiting, however, 5% of the patients will develop bacteremia - however, our strain seems to be more pathogenic.

<HR>

<H3><i>de novo</i> assembly</H3>

<p>Ok now we know what species it is - but this species is found everywhere on the planet, maybe we should see if we could identify it more precisely within the species. For this we can use MLST (Multi Locus Sequence Typing) which is a method used in clinical epidemiology to type different sub-types of bacterial species. Instead of using just one Locus (=gene) as is done with 16S rRNA, it uses multiple loci to identify the type. Luckily we have a server that can do that for us, but first, we need to assemble the genome.</p>

<p> let's start by making a ''denovo'' assembly to identify the insert size for the final assembly. "unknown.soap.conf" is a template configuration file for SOAPdenovo2. Open it with your favorite text editor and insert the correct location of your preprocessed reads</p>

<b>Task 4: Make a denovo assembly using the SOAPdenovo as we did in the denovo exercises.
This task has many steps - refer back to the denovo exercises for all the details. In the second run of SOAPdenovo, use a k-mer size of 65. </b>

<p>Now, let us filter the assembly contigs and scaffolds for minimum length of 100. We can use fastx_filterfasta.py for this. </p>

<pre>
fastx_filterfasta.py --i asmK65.contig --min 100
fastx_filterfasta.py --i asmK65.scafSeq --min 100
assemblathon_stats.pl asmK65.contig.filtered_100.fa > asmK65.contig.stats
assemblathon_stats.pl asmK65.scafSeq.filtered_100.fa > asmK65.scafSeq.stats
</pre>

<p>Look in the <i>asmK65.contig.stats</i> file under the "contigs" section to find the stats for the contigs, also look in the <i>asmK65.scafSeq.stats</i> file under the "scaffold" section to find the stats</p>
<b>Q4. How many contigs and scaffolds do you get and what is the N50?</b>

<HR>

<H3>MLST</H3>

<p>We have an assembly, let's see if we can identify a Multilocus sequence type (MLST) for our strain. This works by finding the particuar MLST alleles in our genome and comparing them to a table to see if there is any match with a known sequence type. Here we are going to use the command-line version but we could just as well have used the [http://cge.cbs.dtu.dk/services/MLST/ webserver]. <b>Figure out what to use as the species name (-d option) by running mlst.pl without any inputs and make sure to exchange the assembly.fa with the scaffold sequences from our assembly.</b> </p>
Remember to replace the ''speciesname'' with the appropriate species name from the list you get from mlst.pl:
<pre>
mlst.pl -d speciesname -i assembly.fa > mlst.res
</pre>


<p>The run takes around 5 minutes to complete. When it is complete, open the file and look at the output - there is ab=n alignment of our assembly vs. the closest match in the MLST database. You see that for the particular species we are using there are 7 loci, and it is the combination of these that make up the sequence type. Also, we require that there is a 100% match over the entire allele to identify a sequence type.</p>

<b>Q5. Do we have 100% matches to a sequence type? If yes, which sequence type (ST-XXX) is our pathogen?</b><br>

<p>This particular sequence type is known to be present in sub-Saharan Africa and to have increased mortality in children - in some countries with a higher death toll than malaria. This is a dangerous sequence type.

<HR>

<H2>SNV based phylogeny</H2>

<p>We can now assume that our pathogen is coming from sub-Saharan Africa, but we do not know where the particular strain is from. To get a higher resolution we can build a phylogeny based on SNPs and compare our strain to other strains. First, let us map the reads to the reference genome of our sequence type.</p>

<H3>Alignment and BAM-processing</H3>

<b>Task 5: Do an alignment - use the reference.fa as the reference genome, and our paired end reads as input. Use bwa index and bwa mem.</b>

<p>Now let us extract all reads with a mapping quality of 30 or better, sort and index the alignments. Then answer the following question. Hint: Use samtools for the first part and use bedtools genomecov and the R script /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R for for calculating the coverage. Use evince to open the pdf.</p>
Example command (substiute your filenames for MYBAM and MY_COVERAGE):
<pre>
bedtools genomecov -ibam MYBAM > MY_COVERAGE
R --vanilla MY_COVERAGE MY_COVERAGE genome < /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R
evince MY_COVERAGE.pdf &
</pre>
<b>Q6. How much of the genome is covered at minimum 10X? </b>

<p>Lets process the alignments - this will be explained in detail in the [https://teaching.healthtech.dtu.dk/material/22126/2021/41_post_alignment_variantcalling_GR.pdf lectures] and exercises for [https://teaching.healthtech.dtu.dk/22126/index.php/Postprocess_exercise alignment] post-processing and [http://teaching.healthtech.dtu.dk/22126/index.php/SNP_calling_exercise variant calling]. We will start with removing duplicates and indexing the de-duped bam file.</p>

<b>Task 6: Remove duplicates from the bam file, sort it and index it. Hint: Use postprocess exercise.</b>


<b>Q7. How many reads were removed as duplicates?</b>

<HR>
<H3>Calling gVCF using HaplotypeCaller</H3>

<p>Now we are ready to call SNPs. To call variants we will use the Haplotyper in GATK. We will set the <b>ploidy to 1 (because it is a haploid organism) </b>and this time output it as a gVCF. This means that we will also information on sites that are non-variant (eg. same as reference genome), but write them in a condensed form. We can then use this gVCF to merge with 4 other samples that I have already processed:</p>

<pre>
samtools index aln.sort.rmdup.bam
gatk --java-options "-Xmx10g" HaplotypeCaller -R reference.fa -I aln.sort.rmdup.bam -O var.raw.g.vcf -ERC GVCF
</pre>


<p>Take a look at the output file (<b>var.raw.g.vcf</b>). You will see that it contains many sites that contain "NON_REF" as the variant allele and have "END=position" - these are intervals that are non-variant at a certain threshold (Genotype Quality (GQ)). We can now genotype our sample together with 4 other samples (named C-F, our sample is called ST313) that I have already process and have as gVCFs - you might as well have processed them, but this is to speed up things a bit.</p>

<pre>
cp /home/projects/22126_NGS/exercises/genomic_epi/gVCF/*.vcf .
gatk CombineGVCFs -R reference.fa -V var.raw.g.vcf -V C.raw.g.vcf -V D.raw.g.vcf -V E.raw.g.vcf -V F.raw.g.vcf -O merged.g.vcf
gatk GenotypeGVCFs -R reference.fa -V merged.g.vcf -O merged.vcf
</pre>

<p>Again take a look at the output (<b>merged.vcf</b>). You see this is a normal VCF file, but now we have information from all samples in the 5 last columns. Because we set the -ploidy 1 when we ran HaplotypeCaller the genotype is now written as "0" or "1" and not as eg. "0/1" because we only have chromosome copy.</p>

<b>Q8. How many variant calls did get for all of our samples? How many of those were indels (Hint: look at the REF, ALT and QUAL columns)?</b>

<p>Now we need to filter the SNPs, because we don't have a catalog of variation we can't do it using the Soft filtering approach (Variant-Quality-Score-Recalibration). Let us also remove the indel calls and then use the SNPs for the phylogeny:</p>

<pre>
gzip merged.vcf
select_snvs.sh merged.vcf.gz merged.snps.vcf.gz
</pre>

<b>Q9. How many SNPs do we have in our final vcf file?</b><br>

<HR>
<H3>Phylogenetic reconstruction</H3>

<p>Now we have our SNPs for our outbreak strain and for several other strains. Let's create a phylogenetic tree using [http://www.atgc-montpellier.fr phyml]. Here we extract all SNPs from our samples and convert them into fasta and hereafter phylip alignment that can be input to phyml. We will run 100 bootstraps.</p>

<pre>
vcf-to-tab < merged.snps.vcf > merged.snps.tab
vcf_tab_to_fasta_alignment.pl -i merged.snps.tab > merged.snps.fasta
trimal -in merged.snps.fasta -phylip > merged.snps.phylip
phyml -i merged.snps.phylip -b 100
</pre>

<p> Let's look at the tree in FigTree. Open the program (below) and chose "File", "Open" and select "merged.snps.phylip_phyml_tree", write bootstrap in the pop-up menu and re-root on the branch going to "F". Click "Node Labels" and select "Display->bootstrap" which will show the bootstrap values on each node in the tree. Try to choose a different layout and see how it looks. You can load annotations by Selecting "File -> Import Annotations". Then go to "/home/projects/22126_NGS/exercises/genomic_epi/strain.info", hereafter click "Tip Labels -> Display and then select "info". You will now see a description of the different strains.</p>

<pre>
java -Xmx512m -jar /home/ctools/FigTree_v1.4.4/lib/figtree.jar &
</pre>


<b>Q10. Which strain is the closest to ours?</b>

<p>What would you use this information for to help track down the source of the outbreak?</p>

<HR>
<H2>How can we treat it?</H2>
<HR>

<p>Now that we have the de novo assembly of our strain, let's look for resistance genes. If we find any of these then we know what <i>not</i> to treat patients with. Here we will use another webserver from the CGE project, the [http://cge.cbs.dtu.dk/services/ResFinder/ ResFinder]. Download the assembly to your own computer and upload them to the server. Chose a threshold of 98% and submit the job.</p>

<b>Q11. Are there any resistance genes in our outbreak strain?</b>

<p>The next step would be to investigate the differences between these and similar but non-pathogenic strains to see if we could find the course of the increased virulence, but this will be for another time.</p>

<p>Congratulations you finished the exercise!</p>

Please find answers [[Genomic_epidemiology_solution|here]]

File:Pca.png

2024-03-19T15:54:57Z

WikiSysop:

File:Bc heatmap.png

2024-03-19T15:54:19Z

WikiSysop:

File:Wilcoxon significant.png

2024-03-19T15:53:59Z

WikiSysop:

File:Sample depth.png

2024-03-19T15:53:35Z

WikiSysop:

QuantitativeMetagenomicsSolution

2024-03-19T15:53:03Z

WikiSysop: Created page with "'''Q1. How many samples do we have and how many genes?''' <pre>> str(Counts) int [1:251436, 1:401] </pre> 251436 genes and 401 individuals '''Q2. What's the sample depth range?''' <pre>> range(sampleDepth) [1] 1533776 41391478 </pre> 1533776 to 41391478 File:Sample_depth.png The figure shows the number of samples (persons) on the y-axis containing the displayed number of reads on the x-axis '''Q3. How many species are there in total?''' <pre> str(taxCounts)..."

'''Q1. How many samples do we have and how many genes?'''
<pre>> str(Counts)
int [1:251436, 1:401] </pre>
251436 genes and 401 individuals

'''Q2. What's the sample depth range?'''
<pre>> range(sampleDepth)
[1] 1533776 41391478 </pre>

1533776 to 41391478

[[File:Sample_depth.png]]

The figure shows the number of samples (persons) on the y-axis containing the displayed number of reads on the x-axis

'''Q3. How many species are there in total?'''
<pre> str(taxCounts)
int [1:120, 1:401] </pre>
120 species

'''Q4. What does a high Shannon diversity index mean?'''

A high Shannon diversity index means that there are many species present with equal abundance. Typical values are generally between 1.5 and 3.5 in most ecological studies, and the index is rarely greater than 4. The Shannon index increases as both the richness and the evenness of the community increase.

'''Q5. Which threshold did you choose and why? How many samples did you lose?'''

A good guestimate is 3e6, where we remove 8 samples.

'''Q6. What is the effect on downsizing on the richness
'''

Lowering the richness

'''Q7. What is the effect on downsizing on diversity (Shannon)'''

No effect, the Shannon index is biased more toward evenness than richness. Since richness weighs rare species just the same as abundant species, this implies that the Shannon index gives more significance to common species.

'''Q8. Is there any significant difference in abundance of E. coli between the different BMI groups?'''

No, the p-value is 0.146

'''Q9. How many species are significant with an FDR < 0.05?'''

[[File:Wilcoxon_significant.png]]

Only 1

'''Q10. Can you see any differences in the abundances - which species have large differences, what are their p-values?'''

Yes, there are differences, especially Ruminoccocus torques is significantly different.

'''Q11. What type of bacteria is the most significant one? [try google]'''

Ruminoccocus torques is a fairly common gastrointestinal bacteria.

'''Q12. Can you see some clusters of samples?'''

[[File:bc_heatmap.png]]

Yes, a Bray-Curtis dissimilarity of zero indicates high similarity and we do see clusters of blue.

'''Q13. Can you see which species that seems to be driving the differences between the samples?'''

[[File:pca.png]]

Yes, they are indicated in red vectors.

'''Q14. which are the most significant species? Is there an overlap between these and using the downsizing+wilcoxon test (what you did above)?
'''

We get much higher significance and the literature suggests actual relevance for obesity.

Please find answers [[Data_Preprocess_exercise_answers|here]]

File:Log fold diff sign.png

2024-03-19T15:52:26Z

WikiSysop:

File:Bmi class.png

2024-03-19T15:52:03Z

WikiSysop:

File:Comparing species abundance.png

2024-03-19T15:51:41Z

WikiSysop:

File:Comparing shannon.png

2024-03-19T15:51:12Z

WikiSysop:

File:Comparing richness.png

2024-03-19T15:50:38Z

WikiSysop:

File:Downsized shannon.png

2024-03-19T15:50:17Z

WikiSysop:

File:Downsized richness.png

2024-03-19T15:49:00Z

WikiSysop:

File:Downsized sampledepth.png

2024-03-19T15:48:31Z

WikiSysop:

File:Raw sampledepth.png

2024-03-19T15:48:11Z

WikiSysop:

File:Raw richnessVSshannonZoom.png

2024-03-19T15:47:46Z

WikiSysop:

File:Raw shannon.png

2024-03-19T15:47:13Z

WikiSysop:

File:Raw richness.png

2024-03-19T15:46:53Z

WikiSysop:

QuantitativeMetagenomics

2024-03-19T15:45:53Z

WikiSysop: Created page with " <H3>Overview</H3> If you need to use metagenomics for your final project, we have a more thorough workflow that you will need to use https://teaching.healthtech.dtu.dk/22136/index.php/22136:Course_plan_autumn_2020 here. Since metagenomics data is often very large, it requires a lot of computational resources and time, we have cheated a little bit and prepared some data for you in advance! In this exercise we have done the assembly and counting across a cohort of..."

<H3>Overview</H3>
If you need to use metagenomics for your final project, we have a more thorough workflow that you will need to use [[https://teaching.healthtech.dtu.dk/22136/index.php/22136:Course_plan_autumn_2020 here]].

Since metagenomics data is often very large, it requires a lot of computational resources and time, we have cheated a little bit and prepared some data for you in advance!

In this exercise we have done the assembly and counting across a cohort of hundreds of human fecal
samples in advance and in addition provide the gene-wise taxonomy and the BMI of the
human donors.
From this data we shall estimate the species richness, diversity and look at the effect of
downsizing. Furthermore we shall see if we can identify any differences between the
microbiome of lean and obese.

<H3>Becoming a pirate</H3>
This exercise uses R either locally (install RStudio on your own machine) or on the server by typing
<pre>R</pre>
First, IF you are running RStudio locally you will need to install a package called "vegan"
<pre>install.packages("vegan")</pre>
Now, let’s load the "vegan" package and thereafter load the read count data from a series of stool samples.
<pre>library("vegan")
load(url("http://teaching.healthtech.dtu.dk/material/22126/Counts_NGS.RData"))
head(Counts)
str(Counts)</pre>
'''Q1. How many samples do we have and how many genes?'''

The different samples may have been sequenced to different depths. Try to count the reads per sample
<pre>
sampleDepth<-(colSums(Counts))
hist(sampleDepth, breaks=100, ylab="Number of samples", xlab="Number of reads", main="Sample depth")
range(sampleDepth)
</pre>

'''Q2. Whats the sample depth range?'''
<H3>Species</H3>
Lets get the genes associated to species. Here is the gene-wise species taxonomy
<pre>load(url("http://teaching.healthtech.dtu.dk/material/22126//taxonomy_species.RData"))
head(taxonomy_species)</pre>
We then combine (by summing) the read counts pr. gene to read counts per species.
<pre>taxCounts<-apply(Counts, 2, tapply, INDEX=taxonomy_species, sum)</pre>
Try looking at the taxCounts matrix
<pre>str(taxCounts)
head(taxCounts)</pre>
'''Q3. How many species are there in total?'''
<H3>Richness and Diversity</H3>
What is the species richness and diversity (Shannon) for the different samples.

'''Q4. What does a high Shannon diversity index mean?'''

OK, lets see it for our samples

<pre>
species_richness<-(colSums(taxCounts>0))
names(species_richness)<-NULL
require(vegan)
speciesDiversity<-diversity(t(taxCounts), index = "shannon")
names(speciesDiversity)<-NULL
par(mfrow=c(1,1))
barplot(sort(species_richness), las=3, main="Species richness", xlab="Samples", ylab="Richness")
barplot(sort(speciesDiversity), xlab="Samples", las=3, main="Diversity (Shannon)")
plot(species_richness,speciesDiversity,xlab="Richness", ylab="Shannon diversity index")
</pre>
[[File:raw_richness.png]][[File:raw_shannon.png]][[File:raw_richnessVSshannonZoom.png]]

Each samples or persons richness and diversity is shown and the third plot shows each sample/persons richness & diversity as a dot.
<H3>Downsizing or rarefying</H3>
But this was on the raw count data with different sampling depth (number of counts) per sample. We should downsize so that we get fair comparisons.

First suggest the number of reads we should sample per sample for the downsizing [target]. If we chose a low target we will loose abundance resolution and detection sensitivity. If we chose it higher we will loose samples.
<pre>> plot(sampleDepth, pch=20, log="y", xlab="Samples", ylab="Number of reads")</pre>
[[File:raw_sampledepth.png]]

There is no right answer (but there are less good suggestions). Insert the number you want to downsize to below and plot it again - the samples above the horizontal line we will keep and the samples below the line we will throw out.

<pre>
> downsizeTarget <- INSERT NUMBER
> plot(sampleDepth, pch=20, log="y", xlab="Samples", ylab="Number of reads"); abline(h=downsizeTarget)
</pre>
[[File:downsized_sampledepth.png]]

'''Q5. Which threshold did you chose and why? How many samples did you loose?'''

OK lets downsize
<pre>
> dz_Counts<-round(t(t(Counts)*downsizeTarget/sampleDepth))
> weak_samples<-sampleDepth<downsizeTarget
> dz_Counts[,weak_samples]<-NA # samples that did not make the cut
</pre>

This is a quick and dirty downsizing (ideally one resampled the reads to a given depth, but that will take days)
Count the species again, now on the downsized data.

<pre>
dz_taxCounts<-apply(dz_Counts, 2, tapply, INDEX=taxonomy_species, sum); gc()
</pre>

And the richness and diversity again, now on downsized data

<pre>
> dz_species_richness<-(colSums(dz_taxCounts>0))
> names(dz_species_richness)<-NULL
> require(vegan)
> dz_speciesDiversity<-diversity(t(dz_taxCounts), index = "shannon")
> dz_speciesDiversity[weak_samples]<-NA
> names(dz_speciesDiversity)<-NULL
</pre>

Now plot the richness and diversity with downsized data

<pre>
> par(mfrow=c(1,1), pch=1)
> barplot(sort(dz_species_richness), las=3, main="Species richness (Downsized)", xlab="Species", ylab="Richness")
</pre>
[[File:downsized_richness.png]]
<pre>
barplot(sort(dz_speciesDiversity), las=3,main="Shannon's diversity index (downsized)", xlab="Species", ylab="Shannon diversity")
</pre>
[[File:downsized_shannon.png]]

And compare to the raw data

<pre>
> plot(dz_species_richness,species_richness, xlab="downsized richness", ylab="raw richness", main="Richness")
</pre>
[[File:Comparing_richness.png]]
<pre>
> plot(dz_speciesDiversity,speciesDiversity,xlab="downsized species diversity", ylab="raw species diversity",main="Diversity (Shannon)")
</pre>
[[File:Comparing_shannon.png]]

'''Q6. What is the effect on the downsizing on richness
'''

'''Q7. What is the effect on the downsizing on diversity (shannon)'''

Lets plot the abundance of each species in a sample with low diversity and a sample with high diversity. You should be able to see a clear difference between the two samples!

<pre>
> par(mfrow=c(1,2))
> barplot(taxCounts[,4], main="Person 4, SD > 3", xaxt="n", xlab="Species", ylab="Normalized abundance")
> barplot(taxCounts[,240], main="Person 240, SD < 0.5", xaxt="n", xlab="Species", ylab="Normalized abundance")
> par(mfrow=c(1,1))
</pre>

[[File:comparing_species_abundance.png]]

<H3>Comparisons</H3>

Now lets see if there is a difference between the microbiome of lean and obese humans. But first load some sample more information: BMI and Class.

<pre>
> load(url("http://teaching.healthtech.dtu.dk/material/22126/BMI.RData"))
> boxplot(BMI$BMI.kg.m2 ~ BMI$Class, col=c("red", "gray","blue"), ylab="BMI")
</pre>
[[File:bmi_class.png]]

Class are: le = Lean; ow = Overweight; ob = Obese

First let us see if the abundance of E. coli differs between obese and lean individuals using a Wilcoxon rank sum test (look for the p-value in the output), also lets get the mean abundance of E. coli in the tree groups :

<pre>
> wilcox.test(x=dz_taxCounts["Escherichia coli",BMI$Classification=="ob"], y=dz_taxCounts["Escherichia coli",BMI$Classification=="le"] )
> tapply(dz_taxCounts["Escherichia coli",], BMI$Classification, mean, na.rm=TRUE)
</pre>

'''Q8. Is there any significant difference in abundance of E. coli between the different BMI groups?'''

Let's test all species correcting for multiple testing using Benjamini-Hochberg (False Discovery Rate) (we are testing 120 species) and plot them:

<pre>
> pval<-apply(dz_taxCounts, 1, function(V){wilcox.test(x=V[BMI$Classification=="ob"],y=V[BMI$Classification=="le"])$p.value})
> Abundance_ratio<-log2(apply(dz_taxCounts, 1,function(V){mean(x=V[BMI$Classification=="ob"], na.rm=TRUE)/mean(V[BMI$Classification=="le"], na.rm=TRUE)}))
> pval.adjust = p.adjust(pval, method="BH")
> plot(sort(pval.adjust), log="y", pch=16, xlab="Species", ylab="p-values")
> abline(h=0.05, col="grey", lty=2)
</pre>

'''Q9. How many species are significant with an false discovery rate < 0.05?'''

Let us look at the top 10 most significant species abundance.

<pre>
> o<-order(pval)
> BMIstat<-data.frame(pval,pval.adjust, Abundance_ratio)[o,]
> BMIstat[1:10,]
> par(mar=c(5,18,5,5))
> barplot(BMIstat[1:10,3], names.arg=rownames(BMIstat)[1:10], las=1,xlab="log fold difference between lean and obese", horiz=TRUE)
</pre>

[[File:log_fold_diff_sign.png]]

'''Q10. Can you see any differences in the abundances - which species have large differences, what are their p-values?'''

'''Q11. What type of bacteria is the most significant one? [try google]'''

<H3>Beta-diversity and PCA</H3>

Plot the Bray-curtis distance between samples as a heatmap.

<pre>
library(RColorBrewer)
library(gplots)
vdist = as.matrix(vegdist(t(taxCounts)))
rownames(vdist) = colnames(vdist)
hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
heatmap.2(vdist, trace='none', col=rev(hmcol))
</pre>

'''Q12. Can you see some clusters of samples?'''

Finally for the PCA:

<pre>
> my.rda <- rda(t(taxCounts))
> biplot(my.rda, display = c("sites", "species"), type = c("text", "points"))
</pre>

'''Q13. Can you see which species that seems to be driving the differences between the samples?'''

<H3>Statistically modelling the variance using DESeq2</H3>

Now, we will see the power of statistically modelling the variance instead of downsizing.

<pre>
> if (!requireNamespace("BiocManager", quietly = TRUE))
> install.packages("BiocManager")
> BiocManager::install("DESeq2")
> library(DESeq2)
> cts <- taxCounts
> coldata = BMI[,1]
> coldata = matrix(NA, nrow=nrow(BMI), ncol=1)
> coldata[,1] = as.vector(BMI[,1])
> rownames(coldata) = rownames(BMI)
> colnames(coldata) = "BMI"
</pre>

Take a look at coldata

<pre>
coldata
</pre>

Make sure that all individuals are in our coldata (information) and also in the data is true

<pre>
all(rownames(coldata) == colnames(cts))
</pre>

Load data into DESeq format, perform statistical analysis and get results

<pre>
> dds <- DESeqDataSetFromMatrix(countData = cts, colData = coldata, design = ~ BMI)
> dds <- DESeq(dds)
> res <- results(dds)
> res
</pre>

Order the results according to the adjusted p-value and show the most significant

<pre>
> resOrdered <- res[order(res$pvalue),]
> head(resOrdered)
</pre>

'''Q14. which are the most significant species (google)? Is there an overlap between these and using downsizing+wilcoxon test (what you did above)?'''

Please find answers [[QuantitativeMetagenomicsSolution|here]]

File:DDseq 350OTU.png

2024-03-19T15:44:52Z

WikiSysop:

File:DDseq 100OTU.png

2024-03-19T15:44:27Z

WikiSysop:

File:PrincipalComponents.png

2024-03-19T15:44:01Z

WikiSysop:

File:Rawkaijubar.png

2024-03-19T15:43:35Z

WikiSysop:

Kaiju solution

2024-03-19T15:43:08Z

WikiSysop: Created page with "<b> Q1: What is nr_euk? And how do the choice of database influence the results of Kaiju?</b> nr stands for non-redundant and indicates, that each entry is only found once within the database. BLASTS' nr database contains bacteria, archea and fungi. euk indicates that microbial eukaryotes and fungi are included. The database should be intended as the target of the search, so it is of importance that it contains the organisms you are searching for. <b> Q2: Explain the..."

<b> Q1: What is nr_euk? And how do the choice of database influence the results of Kaiju?</b>

nr stands for non-redundant and indicates, that each entry is only found once within the database. BLASTS' nr database contains bacteria, archea and fungi.
euk indicates that microbial eukaryotes and fungi are included.

The database should be intended as the target of the search, so it is of importance that it contains the organisms you are searching for.

<b> Q2: Explain the terms precision and sensitivity in relation to testing. </b>

* Precision is how sure you are of your true positives.

* Sensitivity is how sure you are that you are not missing any positives.

Dependent on whether you want to be confident in your true positives or whether it's more important to cover all true negatives you can tune the precision and sensitivity parameters.

<b> Q3: Take a look at pacu_kaiju.otu.tab and pacu_kaiju.tax.tab and explain what information the files contain. </b>

* pacu_kaiju.otu.tab: Contains the read count of all OTU's within all samples.
* pacu_kaiju.tax.tab: Contains information regarding the taxonomic composition of the OTU's, including Domain, Phylum, Class, Order, Family, Genus and Species

<b> Q4: Look at the plot. Which domains do you see in the samples?</b>

The most dominant domain is Bacteria.
However a lot of reads could not be assigned taxonomy and are thus assigned "Unknown".

Archea and Eukaryota is seen to a very limited extend.

[[File:Rawkaijubar.png|450px]]

<b> Q5: Can you think of domains or fields that could be relevant to investigate for other research questions?</b>

The data can be divided into all different taxonomy, such as Domain, Phylum, Class, Order, Family, Genus or even according to Species.

<b> Q6: What is PCA used forr?</b>

PCA is used to reduce the dimensionality of the data in order make it interpretable but at the same time minimising the information loss.

<b> Q7: What do the plot tell us about the principal components and their associated amount of information?</b>

[[File:PrincipalComponents.png|450px|]]

We see that the variation that can be explained by each PC gradually declines, indicating that the first components carries the most information.

<b> Q8: Do we see any significant pairs?</b>

So it seems that only Post-antibiotic vs Antibiotic is more significant than the usual threshold of 0.05.

<b> Q9: How many OTU's are significantly different between the treatments? Try to change the alpha to 0.01. How many OTU's is then significant? </b>

For the threshold 0.05 we see 245 OTU's.

For the threshold 0.01 we see 177 OTU's.

<b> Q10: What does the plots with 100 and 350 OTU's show? Is any phylums dominant? </b>

We see that the log2fold change is negative for most OTU's, indicating that the OTU is more expressed in the control than in the samples which receive antibiotics.

Out of the significant OTU's we see that the majority belongs to the proteobacteria.

[[File:DDseq 100OTU.png|450px]] [[File:DDseq 350OTU.png|450px]]

Kaiju exercise

2024-03-19T15:42:25Z

WikiSysop: Created page with " ==Introduction== [http://kaiju.binf.ku.dk/ Kaiju] is a protein-based sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments. <br> Kaiju translates metagenomic sequencing reads into the six possible reading frames and searches for maximum exact matches (MEMs) of amino acid sequences in a given database of annotat..."

==Introduction==
[http://kaiju.binf.ku.dk/ Kaiju] is a protein-based sensitive taxonomic classification of high-throughput sequencing reads from metagenomic whole genome sequencing or metatranscriptomics experiments. <br>
Kaiju translates metagenomic sequencing reads into the six possible reading frames and searches for maximum exact matches (MEMs) of amino acid sequences in a given database of annotated proteins from microbial reference genomes. If matches to one or more database sequences are found for a read, Kaiju outputs the taxonomic identifier of the corresponding taxon, or it determines the Last Common Ancestor (LCA) in the case of equally good matches to different taxa. Kaiju’s underlying sequence comparison algorithm uses the Burrows–Wheeler transform (BWT) of the protein database, which enables exact string matching in time proportional to the length of the query, to achieve a high classification speed.

In k-mer-based methods, the size of k governs the sensitivity and precision of the search. If k is chosen too large, no identical k-mers between read and database might be found, especially for short or erroneous reads, as well as for evolutionary distant sequences. If k is chosen too small, more false positive matches will be found. Therefore, in order to not be restricted by a prespecified k-mer size, Kaiju finds MEMs between reads and database to achieve both a high sensitivity and precision. Reads are directly assigned to a species or strain, or in case of ambiguity, to higher level nodes in the taxonomic tree. For example, if a read contains an amino acid sequence that is identical in two different species of the same genus then the read will be classified to this genus. Kaiju also offers the possibility to extend matches by allowing a certain number of amino acid substitutions at the end of an exact match in a greedy heuristic approach using the BLOSUM62 substitution matrix. See the [https://www.nature.com/articles/ncomms11257 paper]for a detailed description of Kaiju’s algorithm
The most important adjustments are made to adjust the sensitivity of Kaiju. Running Kaiju in greedy mode will be more sensitive BUT less precise. As in we can see that we have Enterobacterales but not if it is <i>Salmonella</i> or <i>Escherichia</i>.
Sensitivity <> Precision

===Construct Burrows-Wheeler transform and FM-index===
Before classification of reads, Kaiju's database index needs to be built from a reference protein database. You can either create a local index based on the currently available data from GenBank, or download one of the indexes used by the [http://kaiju.binf.ku.dk/server Kaiju web server].
For creating a local index, the program kaiju-makedb will download a source database and the taxonomy files from the NCBI FTP server, convert them into a protein database and construct Kaiju's index (the Burrows-Wheeler transform and the FM-index) in one go.
For this course, we have downloaded the pre-indexed database nr_euk (made 2019-06-25) from the Kaiju website so you do not need to create a database, but can use the <b> Kaiju database at /home/databases/databases/Kaiju/ </b>

===Assigning taxonomy using Kaiju ===
Kaiju needs to know the location of the database (kaiju_db_nr_euk.fmi in this case) which you want to use and the associate nodes.dmp file.
It is able to take paired-end reads as in the example below:
kaiju -t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp -f /home/databases/databases/Kaiju/kaiju_db_nr_euk.fmi -i reads.fastq [-j reads2.fastq]

Mandatory arguments:
-t FILENAME Name of nodes.dmp file
-f FILENAME Name of database (.fmi) file
-i FILENAME Name of input file containing reads in FASTA or FASTQ format

Optional arguments:
-j FILENAME Name of second input file for paired-end reads
-o FILENAME Name of output file. If not specified, output will be printed to STDOUT
-z INT Number of parallel threads for classification (default: 1)
-a STRING Run mode, either "mem" or "greedy" (default: greedy)
-e INT Number of mismatches allowed in Greedy mode (default: 3)
-m INT Minimum match length (default: 11)
-s INT Minimum match score in Greedy mode (default: 65)
-E FLOAT Minimum E-value in Greedy mode
-x Enable SEG low complexity filter (enabled by default)
-X Disable SEG low complexity filter
-p Input sequences are protein sequences
-v Enable verbose output

==Exercise time !!!==
===Kaiju - Protein-based taxonomy===

We are going to compare the effects of using different settings in Kaiju and visualize using [https://github.com/marbl/Krona/wiki Krona]. <br>

In the exercises we will use the database nr_euk.

<b> Q1: What is nr_euk? And how do the choice of database influence the results of Kaiju? </b>

Different settings will lead to differences in accuracy and precision of the model.

<b> Q2: Explain the terms precision and sensitivity in relation to testing. </b>

First, we will run kaiju using the most precise but less sensitive mem mode:
<pre>
kaiju -i /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/SRR7610114_1.fastq \
-j /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/SRR7610114_2.fastq -t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp \
-f /home/databases/databases/Kaiju/kaiju_db_nr_euk.fmi -v -z 5 -a mem -o SRR7610114_mem.kaiju
</pre>
Second, we will run using VERY greedy options allowing 5 mismatches and only an E-value of 0.1
<pre>kaiju -i /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/SRR7610114_1.fastq \
-j /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/SRR7610114_2.fastq -t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp \
-f /home/databases/databases/Kaiju/kaiju_db_nr_euk.fmi -v -z 5 -a greedy -e 5 -E 0.1 -o SRR7610114_greedy.kaiju
</pre>

If you don't have the patience to wait for the program to run, you can also copy the files from
<pre>/home/projects/22126_NGS/exercises/metagenomics/kaiju
</pre>

Try to visualise the results with Krona
<pre>kaiju2krona -i SRR7610114_mem.kaiju -o SRR7610114_mem.krona -t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp \
-n /home/databases/databases/Kaiju/kaiju_db_nr_euk_names.dmp ; ktImportText -o SRR7610114_mem.html SRR7610114_mem.krona </pre>

Download the .html-file and open it in your browser.

If you are interested [https://www.cell.com/cell/pdf/S0092-8674(19)30775-5.pdf many other tools] exists for taxonomic classification. However as the results for SRR7610114 we will continue with Kaiju, which we now want to run on all samples.

<b> <span style="color:red"> AS IT TAKES TOO MUCH TIME FOR EVERYONE TO RUN KAIJU ON ALL SAMPLES YOU SHOULD USE THE RESULTS FOUND IN /home/projects/22126_NGS/exercises/metagenomics/kaiju/all_samples and skip the parallelisation </span> </b>

<pre>parallel -j 1 --xapply "kaiju -i /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/{1} -j /home/projects/22126_NGS/exercises/metagenomics/Pacu/preprocessed/{2} \
-t /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp -f /home/databases/databases/Kaiju/kaiju_db_nr_euk.fmi -o {1.}.kaiju -v -z 30" \
:::: /home/projects/22126_NGS/exercises/metagenomics/Pacu/no_fish_r1.list :::: /home/projects/22126_NGS/exercises/metagenomics/Pacu/no_fish_r2.list;
cd all_samples; rename '_1.kaiju' '.kaiju' *.kaiju</pre>

Now we want to fuse all the information into one table. First we need to create tabulated tables and the tools provided by Kaiju are far from optimal, so we use some homebrewed ones.

<pre>mkdir all_samples;
cp /home/projects/22126_NGS/exercises/metagenomics/kaiju/all_samples/* all_samples/;
/home/ctools/misc_scripts/kaiju2phyloseq.py -i all_samples -n /home/databases/databases/Kaiju/kaiju_db_nr_euk_names.dmp \
-m /home/databases/databases/Kaiju/kaiju_db_nr_euk_nodes.dmp -o pacu_kaiju; rm -r all_samples </pre>

<b> Q3: Take a look at pacu_kaiju.otu.tab and pacu_kaiju.tax.tab and explain what information the files contain. </b>

===Kaiju - RStudio import===
<span style="color:red"> I recommend moving files to your laptop and running RStudio locally. </span style="color:red">

Again, we will use [https://rstudio.com/ RStudio], but first we need RStudio set up with necessary packages etc. This workflow uses [https://www.tidyverse.org Tidyverse] and [https://cran.r-project.org/web/packages/broom/vignettes/broom.html broom]. Furthermore, we will use [http://joey711.github.io/phyloseq/ Phyloseq] and [https://github.com/Russel88/DAtest/wiki DAtest].

<pre>library(tidyverse)
library(broom)
library(phyloseq)
library(DAtest)
library(vegan)</pre>

Load in the data from Kaiju and use the metadata found in /home/projects/22126_NGS/exercises/metagenomics/kaiju/
<pre>otutab <- read.csv("pacu_kaiju.otu.tab", sep = "\t", row.names = 1, header = TRUE)
OTU = otu_table(otutab, taxa_are_rows = TRUE)

taxtab <- read.csv("pacu_kaiju.tax.tab", sep = "\t", row.names = 1, header = TRUE)
taxmat = as.matrix(taxtab)
TAX = tax_table(taxmat)

metadata = read.csv("metadata.csv", sep = ",", skip=1, header=FALSE)
metadata <- rename(metadata, Run=1, Day=2, Treatment=3) #renaming the columns </pre>

Once the data has been loaded we are now ready to perform the phyloseq analysis.
<pre>META = sample_data(metadata)
rownames(META) <-metadata$Run
physeq = phyloseq(OTU, TAX, META)
physeq </pre>

The phyloseq results can be saved as an rds object as:
<pre>saveRDS(physeq, "pacu.phyloseq.rds")</pre>

===Kaiju - Sample composition using R===
To get a feel for the samples we can plot the domain distribution but first we'll make it into a dataframe
<pre>physeq_df <- psmelt(physeq)
rawkaijubarplot <- ggplot(physeq_df, aes(x = Sample, y = Abundance, fill = Domain)) + theme_bw() + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle=90, size=6))
rawkaijubarplot
dir.create("Results")
ggsave("Results/rawkaijubarplot.png") </pre>

<b> Q4: Look at the plot. Which domains do you see in the samples?</b>

<b>For simplicity we will focus on the main components of the known bacterial community. Adjust according to your question, time and resources.</b>
We want to focus on the known bacterial composition, so we do:
<pre>physeq_bac <- subset_taxa(physeq, Domain == "Bacteria")</pre>

<b> Q5: Can you think of domains or fields that could be relevant to investigate for other research questions?</b>

We can filter low abundant taxa based on three criteria:

*They should be present in a minimum amount of samples (min.samples)

*They should have a minimum amount of reads (min.reads)

*They should have a minimum average relative abundance (min.abundance)

You don't have to use all three criteria. The filtered taxa are grouped in a new taxa called "Others".

We only want taxa that have at least a 0.00005 fraction of the total reads. These will be fused into a category called “Others”. This is quite a lot that we filter but it allows us to work faster in this example.
Firstly we identify the number of reads from the fraction 0.00005.
<pre> n_reads <- sum(sample_sums(physeq_bac))*0.00005 </pre>

This number of reads is used for as cutoff.
<pre>physeq_bac_cutoff = preDA(physeq_bac, min.reads = n_reads)
physeq_bac_cutoff</pre>

To visualize we do again make a dataframe
<pre>physeq_bac_cutoff_df <- psmelt(physeq_bac_cutoff) </pre>

And we do the visualization
<pre>ggplot(physeq_bac_cutoff_df, aes(x = Sample, y = Abundance, fill = Family)) + theme_bw() + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle=90, size=6))
ggsave("Results/kaijuClassbarplot.png", width = unit(15,"cm"))</pre>

There is two "NA" categories. One indicates that Kaiju was unable to assign taxonomy, while the other comes from the fused “Others” that are less abundant.

We can also look at the relative abundance of each class and cluster according to treatment.

<pre>physeq_relat_abund <- transform_sample_counts(physeq_bac_cutoff, function(x){x / sum(x)})
phyloseq::plot_bar(physeq_relat_abund, fill = "Phylum") +
geom_bar(aes(color = Phylum, fill = Phylum), stat = "identity", position = "stack") +
labs(x = "", y = "Relative Abundance\n") +
facet_wrap(~ Treatment, scales = "free") +
theme(panel.background = element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggsave("Results/kaijuPhylumRel_abund.png", width = unit(10, "cm"))</pre>

===Kaiju - Beta-diversity analysis using R===
When we start comparing samples, consequences of the compositional nature of sequencing become important.
To do beta-diversity we need to normalize somehow. The classical way is by rarefaction to the number of observations of the smallest sample, however this means removing valuable data. Also, relative abundances become less precise, as it produces false zeroes and [https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003531 several other problems]. Luckily, we now have much better methods available.
We will use a Centered Log-Ratio transformation (CLR), which is a dedicated compositional transformation using log-ratio to the geometric mean of the relative abundance.

Given an observation vector of D “counted” features (taxa, operational taxonomic units or OTUs, genes, etc.) in a sample, x = [x1, x2, …xD] and G(x) is the [https://en.wikipedia.org/wiki/Geometric_mean geometric mean] of x.
We use the geometric mean instead of the more common arithmetic mean, because it is the only correct mean when presented as ratios to reference values.

CLR transformation for the sample can be obtained as follows:

[[File:CLR equation.png]]

However, G(x) cannot be determined for without deleting, replacing or estimating the 0 count values. Read more [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5695134/ here] and [https://academic.oup.com/gigascience/article/8/9/giz107/5572529 here] about CLR.

We perform a zero correction by substituting 0 with 1 and all non-zero values are corrected such that log-ratios before and after correction are the same.
<pre>physeq_zc <- transform_sample_counts(physeq_bac_cutoff, function(y) sapply(y, function(x) ifelse(x==0, 1, (1-(sum(y==0)*1)/sum(y))*x)))</pre>

Now we can perform the log transformation:
<pre>physeq_clr <- transform_sample_counts(physeq_zc, function(x) log(x/exp(mean(log(x))))) </pre>

We can now perform the Principal Component Analysis (PCA).

<b> Q6:What is PCA used for?</b>

Here we do a redundancy analysis (RDA) without restraints which is the same.
<pre>ord_clr <- ordinate(physeq_clr, "RDA")</pre>

Usually, we only look at the first couple of principal components because we can plot them easily but with a scree plot we can look at many more.
<pre>plot_scree(ord_clr) +
geom_bar(stat="identity", fill = "blue") +
labs(x = "\nAxis", y = "Proportion of Variance\n")
ggsave("Results/kaijuclrscree.png")</pre>

<b> Q7: What do the plot tell us about the principal components and their associated amount of information?</b>

Now, we will plot PC1 and PC2 while scaling the plot to reflect the relative amount of information explained by each.
<pre>clr1 <- ord_clr$CA$eig[1] / sum(ord_clr$CA$eig)
clr2 <- ord_clr$CA$eig[2] / sum(ord_clr$CA$eig)</pre>

We can plot were the coordinates are fixed to reflect importance:
<pre>plot_ordination(physeq, ord_clr, type="samples", color="Treatment") +
geom_point(size = 4) +
coord_fixed(clr2 / clr1) +
geom_text(aes(label=Day), colour="black")
ggsave("Results/kaijuclrPCA.png")</pre>

Or more spread out for easy reading:
<pre>plot_ordination(physeq, ord_clr, type="samples", color="Treatment") +
geom_point(size = 6) +
geom_text(aes(label=Day), colour="black")
ggsave("Results/kaijuclrPCA1x1.png")</pre>

We see a clear tendency for especially late phase antibiotic treatment samples being separated from the other samples, but is it statistically significant?
Albeit very useful, PCA is just an exploratory data visualization tool. To test whether the samples cluster beyond what is expected by sampling variability we will use permutational multivariate analysis of variance (PERMANOVA).
It does this by partitioning the sums of squares for the within- and between-cluster components using the concept of centroids. Many permutations of the data (i.e. random shuffling) are used to generate the null distribution.
The test from ADONIS can be confounded by differences in dispersion (or spread)…so we want to check this as well.

First, we create a distance matrix. This is also called the [https://academic.oup.com/bioinformatics/article/34/16/2870/4956011 Aitchison distance].
<pre>clr_dist_matrix <- distance(physeq_clr, method = "euclidean")
adonis(clr_dist_matrix ~ sample_data(physeq_clr)$Treatment, method = "eucledian")</pre>

Why do we not always end up with exactly this result?
So, despite grouping the less significant “Day 1” samples together with the other antibiotic treatment samples or not merging Pre, Post and control samples, the analysis still shows that the samples are not randomly clustered according to treatment.
In fact, Treatment explains 31% of the variance.

Now, we want to do a post-hoc pairwise test to see exactly which treatments drive the variance. This can be done easily using the package pairwiseAdonis - installation guide [https://github.com/pmartinezarbizu/pairwiseAdonis here].

<pre>library(pairwiseAdonis)
pairwise.adonis(clr_dist_matrix, sample_data(physeq_clr)$Treatment, sim.method = "eucledian", p.adjust.m = "holm")</pre>

<b> Q8: Do we see any significant pairs?</b>

===Kaiju - Differential abundance using R===

Now we want to focus on differential abundance. In this case we will look for differential abundance based on phylogeny, but my chosen method can also be used for genes. [https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html DeSeq2] is fairly easy to use, shows consistent performance and works by fitting a negative binomial model for count data. DESeq2 default data normalization is Relative Log Expression (RLE) based on scaling each sample by the median ratio of the sample counts over the geometric mean counts across samples. This [https://www.biorxiv.org/content/10.1101/406264v1.full.pdf paper] explains more about RLE and CLR. What is also very useful is that DeSeq2 easily allows for the inclusion of covariates in the analysis.

First we load the needed R packages and the R-object created in "Kaiju - RStudio import". You will need to install DESeq2.
<pre>library(phyloseq)
library(DAtest)
library(DESeq2)
pacuphyseq = readRDS("pacu.phyloseq.rds")</pre>

Firstly, we need to decide at which level you want to test your hypothesis. Do you want to be specific or is it more fitting with your hypothesis to test at a higher taxonomic level. It really depends on the hypothesis and sometimes it makes sense to do it on several levels.
Secondly, we need to filter. In this example we filter a lot to speed things up, but there really is no golden standard here. Scientifically, it is a trade-off between keeping rare Amplicon sequence variants (ASVs) which could be interesting to test, and removing ASVs to increase statistical power.
Here we collapse the hierarchical taxonomical data at genus level:

<pre>phy_genus <- tax_glom(pacuphyseq, "Genus")</pre>
This takes 5-10 minutes. In the meantime you can read a bit about [https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html DeSeq2]

We display a summary of the data
<pre>phy_genus</pre>

The DESeq2 analysis is carried out.
<pre>treatdds <- phyloseq_to_deseq2(phy_genus, ~ Treatment)
treatdds <- DESeq(treatdds)</pre>

We investigate the results, where we use the threshold for significance of 0.05. Entries with p-values above 0.05 are removed.

<pre>res = results(treatdds, alpha = 0.05)
alpha = 0.05
sigtab = res[which(res$padj < alpha), ]
sigtab = cbind(as(sigtab, "data.frame"), as(tax_table(phy_genus)[rownames(sigtab), ], "matrix"))</pre>

We take a look at the data. The dimensions tells us how many OTU's that are significantly differential abundant between the treatment forms.
<pre>head(sigtab)
dim(sigtab)</pre>

<b> Q9: How many OTU's are significantly different between the treatments? Try to change the alpha to 0.01. How many OTU's is then significant? </b>

In order to visualise the data we sort the OTU's according to p-value and select the 100 OTU's with the lowest p-value.
<pre>sig100 <- sigtab[order(sigtab$padj),][1:100,]</pre>

We visualise the significant OTU's. We choose to color and fill by Phylum and label by Genus.
<pre>library("ggplot2")
theme_set(theme_bw())
scale_fill_discrete <- function(palname = "Set1", ...) {
scale_fill_brewer(palette = palname, ...)
}
# Phylum order
x = tapply(sigtab$log2FoldChange, sigtab$Phylum, function(x) max(x))
x = sort(x, TRUE)
sigtab$Phylum = factor(as.character(sigtab$Phylum), levels=names(x))
# Genus order
x = tapply(sigtab$log2FoldChange, sigtab$Genus, function(x) max(x))
x = sort(x, TRUE)
sigtab$Genus = factor(as.character(sigtab$Genus), levels=names(x))
ggplot(sigtab, aes(x=Genus, y=log2FoldChange, color=Phylum)) + geom_point(size=2) +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust=0.5))
ggsave("Results/DDseq_100OTU.png")</pre>

Try to select the 350 OTU's with the lowest p-value and make the same plot.

<b> Q10: What does the plots with 100 and 350 OTU's show? Are any phylums dominant? </b>

Please find answers [[Kaiju_solution|here]]

Longread exercise answers

2024-03-19T15:41:44Z

WikiSysop: Created page with "'''Q1''' Counting all the lines minus the header gives us: <pre> zcat BGI_hg38_chr20.vcf.gz |grep -v "^#"|wc -l </pre> 1878 variants '''Q2''' We can try the following: <pre> zcat BGI_hg38_chr20.vcf.gz |grep -v "^#" |cut -f 10 |sed "s/:.*//g"|sort | uniq -c |sort -n </pre> <OL> <LI>grep -v "^#" grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. "^#": The pattern to match lines starting with a hash (#). These lines are usually he..."

'''Q1'''

Counting all the lines minus the header gives us:
<pre>
zcat BGI_hg38_chr20.vcf.gz |grep -v "^#"|wc -l
</pre>

1878 variants

'''Q2'''

We can try the following:
<pre>
zcat BGI_hg38_chr20.vcf.gz |grep -v "^#" |cut -f 10 |sed "s/:.*//g"|sort | uniq -c |sort -n
</pre>

<OL>
<LI>grep -v "^#" grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. "^#": The pattern to match lines starting with a hash (#). These lines are usually headers or comments in VCF files.
<LI>cut -f 10
<LI> cut: Extracts specific fields from each line. -f 10: takes the 10th field, i.e. the genotype information.
<LI> sed "s/:.*//g" "s/:.*//g": Removes everything after the first colon in each line. This leaves us with the GT field (e.g., 0/1, 1/1).
<LI> sort Sorts lines of text in alphabetical order.
<LI> uniq -c Counts the number of occurrences of each unique line. Here, it counts each unique genotype.
<LI> sort -n : Sorts the output numerically. This will arrange the genotypes in ascending order based on their frequency.
</OL>

<pre>
33 1/2
85 1|1
182 0|1
587 1/1
991 0/1
</pre>

So 85+182=267 variants are already phased.

'''Q3'''

<pre>
chr20 2855611 rs4364082 T C 938.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.690;DB;DP=47;ExcessHet=3.0103;FS=10.989;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.97;ReadPosRankSum=0.976;SOR=0.117 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:21,26:47:99:0|1:2855611_T_C:946,0,745:2855611
chr20 2855618 rs6051444 C T 886.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.09;DB;DP=46;ExcessHet=3.0103;FS=14.200;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.27;ReadPosRankSum=0.855;SOR=0.387 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:22,24:46:99:0|1:2855611_T_C:894,0,797:2855611
</pre>

They are 6 bases away from each other.

'''Q4'''

Sequences are quite long compared to Illumina/BGI

samtools view HG002_pacbio.bam |awk '{print length($10)}' | awk '{sum+=$1; n++} END {if(n>0) print sum/n}'

'''Q5''':

Aligning+indexing:
<pre>
/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini /home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz |samtools view -uS - |samtools sort /dev/stdin > HG002_pacbio.bam
samtools index HG002_pacbio.bam
</pre>
Then:
<pre>
samtools view HG002_pacbio.bam |awk '{print length($10)}' | awk '{sum+=$1; n++} END {if(n>0) print sum/n}'
</pre>

9474.43 so almost 10kb.

'''Q6''': yes

'''Q7''':

First phase:
<pre>
whatshap phase --ignore-read-groups --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o BGI_hg38_chr20_phased.vcf.gz BGI_hg38_chr20.vcf.gz HG002_pacbio.bam
</pre>

Then run:

<pre>
zcat BGI_hg38_chr20_phased.vcf.gz |grep -v "^#" |cut -f 10 |sed "s/:.*//g"|sort | uniq -c |sort -n
</pre>

To get:
<pre>
7 0/1
33 1/2
550 1|0
616 0|1
672 1/1
</pre>

so 550+616=1166 phased variants compared to 267 previously.

Longread exercise

2024-03-19T15:41:18Z

WikiSysop: Created page with "<H2>Overview</H2> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "longread" <LI>Navigate to the directory you just created. </OL> We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app). First, what is phasing? Phasing means that we determine which base is on the same chromosome as another base for neighboring variants. Let's consider a small example with just two varia..."

<H2>Overview</H2>

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "longread"
<LI>Navigate to the directory you just created.
</OL>

We will phase some variants using [https://www.biorxiv.org/content/10.1101/085050v2 WhatsHap] (no not the messaging app).

First, what is phasing?

Phasing means that we determine which base is on the same chromosome as another base for neighboring variants.
Let's consider a small example with just two variants (single nucleotide polymorphisms or SNPs) to illustrate phasing:

<OL>
<LI> SNP1: Located on chromosome 1 at position 1000. The individual is heterozygous A or G.
<LI> SNP2: Located on chromosome 1 at position 2000. The individual is heterozygous C or T.
</OL>

Great! but do we have:
<OL>
<LI> A and C on the same chromosome and G and T on the other chromosome
<LI> A and T on the same chromosome and G and C on the other chromosome
</OL>

Without phasing we don't have this information. This is important because phasing informs us about the phenotypic (therefore for health and reaction to drugs/treatments) consequences of the different bases.

In a VCF, unphased variants will appear like this:
<pre>
chr1 1000 rs123 A G 29 PASS INFO GT 0/1
chr1 2000 rs456 C T 29 PASS INFO GT 0/1
</pre>

'''0/1''' means heterozygous reference+alternative.

Now, if A and C are on the same chromosome, phased variants can appear as:
<pre>
chr1 1000 rs123 A G 29 PASS INFO GT 0|1
chr1 2000 rs456 C T 29 PASS INFO GT 0|1
</pre>

but if A and C are on different chromosomes, phased variants can appear as:
<pre>
chr1 1000 rs123 A G 29 PASS INFO GT 0|1
chr1 2000 rs456 C T 29 PASS INFO GT 1|0
</pre>

In this exercise, we will:
<OL>
<LI> Do standard genotyping using BGI sequencing from an [https://en.wikipedia.org/wiki/Ashkenazi_Jews Ashkenazi] individual
<LI> Align long reads from PacBio
<LI> Learn how to install software using [https://bioconda.github.io/ bioconda]
<LI> Use the long reads to phase our variants
</OL>

<H2>Genotyping with BGI reads</H2>

The reads are here:
<pre>
/home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz
/home/projects/22126_NGS/exercises/long_reads/BGI2.fq.gz
</pre>

They do not have adapters. As we have previously covered aligning and genotyping, you can copy paste the commands, just make sure you understand what you are doing. First, let's have a look at BGI-Seq data:
<pre>
zcat /home/projects/22126_NGS/exercises/long_reads/BGI1.fq.gz |head
</pre>

You will notice it is very much like Illumina in terms of read length and encoding.

Just go ahead and align them using bwa mem and sort them:

<pre>
bwa mem -R "@RG\tID:HG002\tSM:HG002" -t 10 /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa BGI1.fq.gz BGI2.fq.gz |samtools view -uS - |samtools sort /dev/stdin > BGI_hg38.bam
</pre>

Let's remove duplicates:
<pre>
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates -I BGI_hg38.bam -M BGI_hg38_metrics.txt -O BGI_hg38_rmdup.bam
</pre>

then index:
<pre>
samtools index BGI_hg38_rmdup.bam
</pre>

Let's genotype:
<pre>
gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I BGI_hg38_rmdup.bam -L chr20:2000000-3000000 -O BGI_hg38_chr20.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF
gatk IndexFeatureFile -I BGI_hg38_chr20.gvcf.gz
gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V BGI_hg38_chr20.gvcf.gz -O BGI_hg38_chr20.vcf.gz -L chr20:2000000-3000000 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>

Notice that GATK will phase variants on the same read (or pairs).
'''Q1''': How many variants are there? (hint: do not forget to remove the header)
'''Q2''': How many variants are phased? (hint: remove the header and look at the 10th column (the genotype info) using cut).

'''Q3''': Consider rs4364082 and rs6051444 (hint search using grep). Why are these variants phased?

<H2>Align PacBio reads</H2>

First, let's have a look at PacBio data:

<pre>
/home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz
</pre>

'''Q4''': What do you notice?

Let's align to hg38+sort:
<pre>
/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini [input fastq here] |samtools view -uS - |samtools sort /dev/stdin > [output BAM here]
</pre>

-a forces sam output which is converted to bam and sorted.
Let's index:
<pre>
samtools index [bam file]
</pre>

'''Q5''': What is the average read length? (hint: awk '{print length($10)}' prints the length of the 10th field, the sequences. hint2: to compute the average of the first column of numbers use: awk '{sum+=$1; n++} END {if(n>0) print sum/n})

<H2>Use bioconda to install software</H2>

Bioconda is a game-changer for anyone starting bioinformatics. It allows you to install software very easily and offers a vast repository of bioinformatics tools. It solves the problem of you needing library A to install software B and needing C to build A etc. All you need is to setup an "environment" where your software will be installed. An "environment" is a directory in your home dir where the software and its depencies will be installed.

Beware! The behavior of several commands like python will not be the same as it will use the python from your environment.

Let's install WhatsHap through bioconda. First, let's create an environment called whatshap-env and install whatshap:

<pre>
/home/ctools/bin/conda create -n whatshap-env bioconda::whatshap
</pre>

Then init the environment:
<pre>
/home/ctools/bin/conda init bash
</pre>

'''Log out and log back in'''

Activate the environment:
<pre>
conda activate whatshap-env
</pre>

Check the installation:
<pre>
whatshap --help
</pre>

'''Q6''': Was that easy?

<H2>Phase variants using WhatsHap</H2>

Then let's phase our variants:
<pre>
whatshap phase --ignore-read-groups --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o [output vcf] [input vcf] [long reads bam]
</pre>

'''Q7''' How many extra variants are phased?

To deactivate conda write:
<pre>
conda deactivate
</pre>

You did not like conda? Do not forget to remove the following from your ~/.bashrc:

<pre>
# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
__conda_setup="$('/home/ctools/anaconda3_2021.11/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh" ]; then
. "/home/ctools/anaconda3_2021.11/etc/profile.d/conda.sh"
else
export PATH="/home/ctools/anaconda3_2021.11/bin:$PATH"
fi
fi
</pre>

and remove ~/.conda/:

<pre>
rm -rfv ~/.conda/
</pre>

Please find the answers [[Longread_exercise_answers|here]]

'''Congratulations you finished the exercise!'''

File:Rnaseq fig3.png

2024-03-19T15:39:51Z

WikiSysop:

File:Rnaseq fig2.png

2024-03-19T15:39:24Z

WikiSysop:

File:Rnaseq fig1.png

2024-03-19T15:39:01Z

WikiSysop:

Rnaseq exercise answers

2024-03-19T15:37:57Z

WikiSysop: Created page with " <div class="page-content has-page-title"> <div id="overview-and-background" class="section level1"> <h1>Overview and background</h1> <div id="groups" class="section level2"> <h2>Groups</h2> <p>Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.</p> </div> <div id="assignment-notes" class="section level2"> <h2>Assignment notes</h2> <p>While some question..."

<div class="page-content has-page-title">
<div id="overview-and-background" class="section level1">
<h1>Overview and background</h1>
<div id="groups" class="section level2">
<h2>Groups</h2>
<p>Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.</p>
</div>

<div id="assignment-notes" class="section level2">
<h2>Assignment notes</h2>
<p>While some questions might seem hard we naturally don’t ask questions/tasks which you have not been given the tools to solve in this assignment - so if you are stuck try thinking about what you have already learned before asking an instructor.</p>
</div>

<div id="assignment-overview" class="section level2">
<h2>Assignment overview</h2>
<p>In this assignment you are going to analyze RNA-sequencing data from real cancer patients to analyze the importance of alternative splicing in a clinical context</p>
</div>

<div id="biological-background" class="section level2">
<h2>Biological background</h2>
<p>Today you will be working with colorectal cancers - specifically Colon Adenocarcinoma (often abbreviated COAD). It is a cancer of the colon that is very frequent. The lifetime risk of developing
colorectal cancer is ~4% for both males and females. That means COAD represents ~10% of all cancers and results in the death of hundreds of thousands of people each year! (More info on COAD can be found on [https://en.wikipedia.org/wiki/Colorectal_cancer Wikipedia].</p>

<p>One important aspect of cancer is that tumors from different patients are extremely different even when they originate from the same tissue (more info on tumor heterogeneity [https://en.wikipedia.org/wiki/Tumour_heterogeneity here]). To improve treatment and prognosis we therefore try to classify COAD into cancer subtypes (a simple form of precision medicine). We currently think there are 5 subtypes (see [https://www.cell.com/cancer-cell/pdf/S1535-6108(18)30114-4.pdf Liu ''et al.'']) and today you will be working with CIN and GS. CIN is an abbreviation for Chromosomal INstable and GS means genome stable. More on that later.</p>

<p>To help us understand COAD subtypes you will today compare these to healthy adjacent tissue. For all samples a biopsy was taken and bulk RNA-seq performed. Low-quality samples have been removed.</p>

</div>
<div id="bioinformatic-background" class="section level2">
<h2>Bioinformatic background</h2>
<p>For background on transcriptomics and splicing please refer to today’s slides. The data you are working with is a randomly selected a subset of the TCGA COAD data (google TCGA if you want to know more). The data was quantified with Kallisto against the human transcriptome.</p>

<p>Today you will be using the 'pairedGSEA' R package we developed. This package is specifically designed to make it easy to do the following analysis:</p>

<ol style="list-style-type: decimal">
<li>Differential gene expression (aka DGE) via DESeq(2)</li>
<li>Differential gene usage (differential splicing) (aka DGU)</li>
<li>gene-set over-representation analysis (ORA) on DGU and DGE
results</li>
</ol>
<p>While at each step facilitating easy comparison of DGE and DGU.</p>
<hr />
</div>
</div>
<div id="assignment" class="section level1">
<h1>Assignment</h1>
<div id="step-1-determine-which-cancer-to-work-with" class="section level2">
<h2>Step 1: Determine which cancer to work with</h2>
<p>Determine which cancer type you will work with:</p>
<ul>
<li>If your birthday is within the first 6 months of the year (January-June) you will work with <strong>CIN</strong>.</li>
<li>If your birthday is within the last 6 months of the year (July-December) you will work with <strong>GS</strong>.</li>
</ul>
</div>
<div id="step-2-set-up-enviroment" class="section level2">
<h2>Step 2: Set up enviroment</h2>
<p>Log into the server as you usually do except this time you have to use the '-X' option. That means using:

<pre>
ssh -X username@pupil1.healthtech.dtu.dk</pre>.
</p>

<p>Make a directory for this exercise and move into it</p>
<pre>
mkdir transcriptomics_exercise
cd transcriptomics_exercise
</pre>

<p>Copy the exercise data of your cancer subtype to your folder</p>
<pre>
### for CIN subtype:
cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_cin.Rdata .

### For GS subtype:
cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_gs.Rdata .
</pre>

</div>
<div id="step-3-start-r-session-and-enviroment" class="section level2">
<h2>Step 3: Start R session and enviroment</h2>
<p>To start an R session in your terminal typing (or copy/pasting)</p>
<pre>
R-4.2.2
</pre>
<p>And load the library we need by typing</p>
<pre>
library(pairedGSEA)
</pre>

<p>This loads the functionality of the “pairedGSEA” R package.</p>
</div>
<div id="step-4-load-and-inspect-data" class="section level2">
<h2>Step 4: Load and inspect data</h2>
<p>Load the assignment data into your R session:</p>
<pre>
### for CIN subtype:
load('coad_iso_subset_cin.Rdata')

### For GS subtype:
load('coad_iso_subset_gs.Rdata')
</pre>
<p>This will give you two data objects in your R session:</p>
<ol style="list-style-type: decimal">
<li>A count matrix</li>
<li>A matrix with meta information about each sample in the count matrix.</li>
<li>A list of gene_sets that you should use for your ORA analysis (step 7).</li>
</ol>

<p>All objects can be directly used by the 'pairedGSEA'
package - no need to do any data modifications.</p>
<p><br></p>
<p>Use the following functions to take a look at the data:</p>
<pre>
### List objects in an R session
ls()

### Inspect the first lines of the object
head( <object_name> )
</pre>

<p><strong>Question</strong>: Which object contains what data?</p>
<p><strong>Answer</strong>:</p>
<ol style="list-style-type: decimal">
<li>cinCountsSubset : Count data</li>
<li>cinMeta : Condition info (ctrl vs cancer)</li>
<li>gene_set_list : List of gene-sets</li>
</ol>
</div>
<div id="step-5-run-differential-analysis" class="section level2">
<h2>Step 5: Run differential analysis</h2>
<p>Next you will need to use the 'pairedGSEA' package and
here a bit of self-study is needed. <strong>Importantly</strong> you
should only run this analysis once per group - else we don’t have
enough computational power. You can download the
'pairedGSEA' vignette (short document showing how to use it)
<a href="https://www.dropbox.com/s/oalth29pxulffec/pairedGSEA.html?dl=1">here</a>.</p>
<p>Hints:</p>
<ol style="list-style-type: decimal">
<li>After reading the introduction you can skip to the
'3.3 Running the analysis' section.</li>
<li>For now you only need to use 'paired_diff()' as that
makes both differential analyses (both DGE and DGU).</li>
<li>There is no need to use the “store_results” option</li>
</ol>
<p><strong>Question</strong>: This will take a while to run (~10 min).
In the mean time take a closer look at the Liu <em>et al.</em> paper
(see above) and summarise what the difference between the CIN and GS
COAD subtypes are.</p>
<p><strong>Answer</strong>:</p>
<pre>
gi_diff_results <- paired_diff(
object = cinCountsSubset,
metadata = cinMeta, # Use with count matrix or if you want to change it in
# the input object
group_col = 'condition',
sample_col = 'sample_id',
baseline = 'Control',
case = 'COAD_genome_instable',
store_results = FALSE
)
</pre>
</div>
<div id="step-6-inspect-diffrential-result" class="section level2">
<h2>Step 6: Inspect diffrential result</h2>
<p><strong>Question</strong>: Look at the first 10 lines of the result
file. Which gene is most significant (smallest p-value) for the DGE and
DGU analysis (respectively DESeq2 and DEXSeq)</p>
<p><strong>Answer</strong>:</p>
<ul>
<li>DESeq2 (DGE): AAR2</li>
<li>DEXSeq (DGU): A1BG</li>
</ul>
<p><br></p>
<p>The following code <em>example</em> counts how many significantly
differentially expressed genes are found:</p>
<pre>
sum( gi_diff_results$padj_deseq < 0.05, na.rm = T )
</pre>
<p><strong>Question</strong>: Modify the R code above to count how many
genes are DGE and DGU.</p>
<p><strong>Answer</strong></p>
<pre>
sum( gi_diff_results$padj_deseq < 0.05, na.rm = T )
# 4860
sum( gi_diff_results$padj_dexseq < 0.05, na.rm = T )
# 2117
</pre>

<p><strong>Question</strong>: Use the 'nrow()' function to
calculate the fraction of genes that are DGE and DGU.</p>
<p><strong>Answer</strong>:</p>
<pre>
sum( gi_diff_results$padj_deseq < 0.05, na.rm = T ) / nrow(gi_diff_results)
# 0.66
sum( gi_diff_results$padj_dexseq < 0.05, na.rm = T ) / nrow(gi_diff_results)
# 0.29
</pre>

<p>Now we are ready to do the gene-set enrichment analysis.</p>
</div>
<div id="step-7-run-gene-set-enrichment-analysis" class="section level2">
<h2>Step 7: Run Gene-Set Enrichment Analysis</h2>
<p>Use the vignette to help you use 'pairedGSEA' to run GSEA on both DGE and DGU results (see the vignette section 4: “Over-Representation Analysis”). You should use the 'gene_set_list' object you have already loaded into R instead of using the 'prepare_msigdb()' function.</p>

<p>Note: There is (again) no need to store the intermediary results.</p>
<p><strong>Answer</strong></p>
<pre>
gi_paired_ora <- paired_ora(
paired_diff_result = gi_diff_results,
gene_sets = gene_set_list,
experiment_title = NULL
)
</pre>
</div>
<div id="step-8-inspect-ora-result" class="section level2">
<h2>Step 8: Inspect ORA result</h2>
<p>What you have been analyzing so far is a subset of the entire dataset
(since the runtime else would have been 3-4x longer). To enable a more
realistic last step use <strong>one</strong> of these commands to load
the full results corresponding to what you have been working with.</p>
<pre>
### for CIN subtype:
load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_cin_ora.Rdata')
# loads the "cin_ora" object

### For GS subtype:
load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_gs_ora.Rdata')
# loads the gs_ora object
</pre>
<p>The following code <em>example</em> extract the ORA analysis of
either DGU and DGE and sorts it so the most significant gene-sets are at
the top.</p>

<pre>
### DGE:
dge_ora_sorted <- gi_paired_ora[
sort.list(gi_paired_ora$pval_deseq), # sort part
c('pathway','pval_deseq','enrichment_score_deseq') # select part
]

### DGU ORA:
dgu_ora_sorted <- gi_paired_ora[
sort.list(gi_paired_ora$pval_dexseq), # sort part
c('pathway','pval_dexseq','enrichment_score_dexseq') # select part
]
</pre>

<p><strong>Question</strong>: Look at the 10-15 most significant gene
sets from both analyses. What are the similarities and differences?</p>

<p><strong>Answer</strong></p>
<pre>
### DGE:
dge_ora_sorted <- cin_ora[
sort.list(cin_ora$pval_deseq), # sort part
c('pathway','pval_deseq','enrichment_score_deseq') # select part
]

head(dge_ora_sorted, 15)
</pre>

<pre>
## pathway pval_deseq
## 3823 REACTOME_RRNA_PROCESSING 3.694775e-19
## 4433 GOBP_RIBONUCLEOPROTEIN_COMPLEX_BIOGENESIS 4.453501e-17
## 3879 GOBP_RIBOSOME_BIOGENESIS 5.320229e-16
## 3785 KEGG_RIBOSOME 2.192710e-14
## 1061 GOBP_MITOTIC_CELL_CYCLE_PROCESS 1.962214e-13
## 4700 HALLMARK_E2F_TARGETS 2.038376e-13
## 977 REACTOME_CELL_CYCLE 2.567524e-13
## 3759 REACTOME_EUKARYOTIC_TRANSLATION_ELONGATION 3.350223e-13
## 3828 REACTOME_SELENOAMINO_ACID_METABOLISM 4.766866e-13
## 4598 HALLMARK_G2M_CHECKPOINT 7.966641e-13
## 3923 REACTOME_EUKARYOTIC_TRANSLATION_INITIATION 3.734833e-12
## 864 GOCC_NUCLEOLUS 6.125940e-12
## 747 REACTOME_INFECTIOUS_DISEASE 7.879724e-12
## 425 GOBP_RESPONSE_TO_ORGANIC_CYCLIC_COMPOUND 8.207220e-12
## 2449 GOCC_ANCHORING_JUNCTION 9.689453e-12
## enrichment_score_deseq
## 3823 0.6239502
## 4433 0.4724997
## 3879 0.5166409
## 3785 0.7224417
## 1061 0.3588459
## 4700 0.5451123
## 977 0.3721868
## 3759 0.6923103
## 3828 0.6601218
## 4598 0.5398054
## 3923 0.6205530
## 864 0.3070925
## 747 0.3249413
## 425 0.3294544
## 2449 0.3259827
</pre>

<ul>
<li>DGE: something with RIBOSOME and CELL_CYCLE</li>
</ul>
<pre class="r">
### DGU ORA:
dgu_ora_sorted <- cin_ora[
sort.list(cin_ora$pval_dexseq), # sort part
c('pathway','pval_dexseq','enrichment_score_dexseq') # select part
]
head(dgu_ora_sorted, 15)
</pre>
<pre>
## pathway
## 2757 GOBP_ACTIN_FILAMENT_BASED_PROCESS
## 2449 GOCC_ANCHORING_JUNCTION
## 2787 REACTOME_SIGNALING_BY_RHO_GTPASES_MIRO_GTPASES_AND_RHOBTB3
## 3180 GOMF_NUCLEOSIDE_TRIPHOSPHATASE_REGULATOR_ACTIVITY
## 2259 GOMF_ENZYME_REGULATOR_ACTIVITY
## 2345 GOMF_CYTOSKELETAL_PROTEIN_BINDING
## 2682 GOMF_TRANSFERASE_ACTIVITY_TRANSFERRING_PHOSPHORUS_CONTAINING_GROUPS
## 3363 GOBP_REGULATION_OF_SMALL_GTPASE_MEDIATED_SIGNAL_TRANSDUCTION
## 2045 GOCC_SUPRAMOLECULAR_COMPLEX
## 2806 GOMF_PROTEIN_DOMAIN_SPECIFIC_BINDING
## 2869 GOBP_SMALL_GTPASE_MEDIATED_SIGNAL_TRANSDUCTION
## 1781 GOBP_POSITIVE_REGULATION_OF_CATALYTIC_ACTIVITY
## 3047 WP_VEGFAVEGFR2_SIGNALING_PATHWAY
## 2377 GOBP_ORGANOPHOSPHATE_METABOLIC_PROCESS
## 2072 GOBP_CELL_MORPHOGENESIS
## pval_dexseq enrichment_score_dexseq
## 2757 3.504255e-16 0.7528919
## 2449 7.263291e-15 0.7065107
## 2787 1.728464e-14 0.7489251
## 3180 1.837683e-14 0.8509545
## 2259 2.393632e-14 0.6135585
## 2345 4.224209e-14 0.6584802
## 2682 1.400344e-13 0.6628826
## 3363 3.661677e-13 0.9806811
## 2045 4.692953e-13 0.5934650
## 2806 2.857812e-12 0.7084235
## 2869 3.593536e-12 0.7767232
## 1781 5.678506e-12 0.5736414
## 3047 5.845276e-12 0.8127997
## 2377 6.168410e-12 0.6300335
## 2072 1.251178e-11 0.6002377
</pre>
<ul>
<li>DGU: something with ACTIN, JUNCTION and SIGNALING</li>
</ul>
</div>
<div id="step-9-visual-inspection-of-ora-result" class="section level2">
<h2>Step 9: Visual inspection of ORA result</h2>
<p><strong>Question</strong>: Based on your insights from step 8 use the 'plot_ora()' functionality to test if these are just examples or generalize to all the significant results. An example: If I from the 10-15 top gene-sets saw that only DGU had gene-sets covering “telomer” function I would use the 'plot_ora()' function to test this.</p>
<p><strong>Answer</strong></p>

<pre class="r">
plot_ora(
ora=cin_ora,
plotly = FALSE,
pattern = "CELL_CYCLE", # Identify all gene sets about telomeres
cutoff = 0.1, # Only include significant gene sets
lines = TRUE, # Guide lines
colors = c('red','blue','black')
)
</pre>

[[File:Rnaseq_fig1.png]]

<p>Looks like cell cycle changes are mediated by both (enrichment is on the diagonal) and the majority is significant for both DGE and DGU.</p>

<pre>
plot_ora(
ora=cin_ora,
plotly = FALSE,
pattern = "RIBOSOME", # Identify all gene sets about telomeres
cutoff = 0.33, # Only include significant gene sets
lines = TRUE, # Guide lines
colors = c('red','blue','black')
)
</pre>

<p>
[[File:Rnaseq_fig2.png]]

Ribosome is clearly mainly significant for DGE.</p>
<pre>
plot_ora(
ora=cin_ora,
plotly = FALSE,
pattern = "ACTIN", # Identify all gene sets about telomeres
cutoff = 0.33, # Only include significant gene sets
lines = TRUE, # Guide lines
colors = c('red','blue','black')
)
</pre>

[[File:Rnaseq_fig3.png]]

<p>Although many actin-related pathways are significant for both DGU and DGE more are DGU. Also, the enrichment among DGU is more pronounced (points are to the right of the diagonal line).</p>
<p><br></p>

<p>Lastly, note the low correlation suggesting an overall low similarity in biological signaling mediated through DGE and DGU.</p>
<p><strong>Question</strong>: Try to make a hypothesis as to why this/these molecular functions might be important for cancer.</p>

<p><strong>Answer</strong>:</p>
<ul>
<li>CELL_CYCLE: One of the main hallmarks of cancer - uncontrolled cell division.</li>
<li>RIBOSOME: Many ribosomes are needed when cells are dividing (as indicated by increased cell cycle).</li>
<li>ACTIN: Actin is involved in cell movement and thereby cancer invasion and metastasis.</li>
</ul>
</div>
<div id="step-10-critical-self-evaluation" class="section level2">

<h2>Step 10: Critical self evaluation</h2>
<p><strong>Question</strong>: Take a moment to think about what potential problems there could be with this assignment. Are there any obvious things we have not taken into consideration?</p>

<p><strong>Answer</strong>: The main problems are:</p>
<ol style="list-style-type: decimal">
<li>More QC should have been done (clustering, outliers, etc)</li>
<li>This is only a subset of the data (the real dataset has ~300 cancer samples)</li>
<li>We do not take co-factors into account. How many of the effects are due to e.g. gender and age differences?</li>
</ol>
</div>
<div id="step-11-repport-result" class="section level2">

<h2>Step 11: Report result</h2>
<p>Go to the blackboard and report one or more of the following:</p>
<ul>
<li>A keyword that showed a similar enrichment pattern in DGU and DGE</li>
<li>A keyword that showed preferential regulation through DGU or DGE</li>
</ul>
<hr/>
</div>
</div>
<div id="bonus-assignment" class="section level1">
<h1>Bonus Assignment</h1>
<p>Use 'pairedGSEA' to analyze the other COAD cancer subtype (the one you did not analyze). Are the gene-sets similar or different between the subtypes and analysis types?</p>
</div>
</div>

Rnaseq exercise

2024-03-19T15:37:12Z

Ancient DNA exercise answers

2024-03-19T15:36:27Z

WikiSysop: Created page with "'''Q1''' the read length is about 100bp but the actual insert size is unknown. '''Q2''' very low, less than 1% '''Q3''' About 40bp. '''Q4''' About 25%. '''Q5''' As and Gs '''Q6''' The sample indeed looks ancient. If we did not see DNA fragmentation or damage it could be indicative of present-day human contamination. '''Q7''' <pre> wc -l world.fam wc -l world.bim </pre> 297 samples and 587772 SNPs. '''Q8''' <pre> cut -f2 world.sampleInfo.txt | tail -n +2..."

'''Q1'''

the read length is about 100bp but the actual insert size is unknown.

'''Q2'''

very low, less than 1%

'''Q3'''

About 40bp.

'''Q4'''

About 25%.

'''Q5'''

As and Gs

'''Q6'''

The sample indeed looks ancient. If we did not see DNA fragmentation or damage it could be indicative of present-day human contamination.

'''Q7'''

<pre>
wc -l world.fam
wc -l world.bim
</pre>
297 samples and 587772 SNPs.

'''Q8'''

<pre>
cut -f2 world.sampleInfo.txt | tail -n +2 | sort | uniq -c|sort -rn
70 Yoruba
33 Han
29 Basque
27 Sardinian
25 French
20 Hungarian
20 Greek
19 Bedouin2
17 Adygei
10 Lithuanian
10 Armenian
8 Tuscan
1 UstIshim
1 Stuttgart
1 Samara
1 NE1
1 MA1
1 Loschbour
1 Karelia
1 Iceman
1 Brana
</pre>

'''Q9:'''

You should be getting the same:
<pre>
plink --bfile world --missing --out world
587772 variants loaded from .bim file.
297 people (0 males, 0 females, 297 ambiguous) loaded from .fam.
</pre>

'''Q10:'''

<pre>
cat world.imiss |grep -i ice
Iceman Iceman Y 11873 587772 0.0202
</pre>

so about 2%.

'''Q11'''

<pre>
zcat RISE507.pileup.gz |wc -l
102014
</pre>

'''Q12'''

Using:
<pre>
plink --bfile world --bmerge RISE507 --out RISE507.merge
</pre>

should result in:
<pre>
Error: 253 variants with 3+ alleles present.
</pre>

This normally is due to tri-allelic sites. Normally they should be very few. However, in our case, there are a lot. This is likely due to damage that creates spurious variations.

'''Q13:'''

The Yoruba.

'''Q14:'''

The Han.

'''Q15:'''

The [https://en.wikipedia.org/wiki/Adyghe_people Adygei]

'''Q16'''

The Sardinians

'''Q17'''

the [https://en.wikipedia.org/wiki/Ust%27-Ishim_man Ust-Ishim] and the [https://en.wikipedia.org/wiki/Mal%27ta%E2%80%93Buret%27_culture Mal'ta–Buret' boy] (MA1).

There are many reasons that can explain this:
# the ancient individuals completely fall outside the range of genomic diversity of modern humans i.e. they were isolated populations that potentially died off.
# these were individuals with mixed ancestry
# they contain numerous errors due to damage

'''Q18:'''

The RISE507 sample from Afanasievo culture.

'''Q19:'''

Our individual is now a bit outside of the Hungarian / French cluster.

'''Q20'''

[https://www.nature.com/articles/nature14507 Allentoft et al. 2015] actually found that the individuals from the [https://en.wikipedia.org/wiki/Afanasievo_culture Afanasievo] were genetically indistinguishable from the [https://en.wikipedia.org/wiki/Yamnaya_culture Yamnhaya culture] which is a culture closely related to [https://en.wikipedia.org/wiki/Western_Steppe_Herders Western Steppe Herders] which is one of the major genetic contributor to present-day Europeans.

Ancient DNA exercise

2024-03-19T15:35:43Z

WikiSysop: Created page with "<H2>Overview</H2> Adapted from Martin Sikora. First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "adna" <LI>Navigate to the directory you just created. </OL> We will try to # Authenticate ancient DNA # do some basic population genetics <h2> Data authentication</h2> Authentication involves making sure that the DNA that you have extracted from my fossil and sequenced is indeed from the fossil and not some modern contaminant. A big differe..."

<H2>Overview</H2>

Adapted from Martin Sikora.

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "adna"
<LI>Navigate to the directory you just created.
</OL>

We will try to
# Authenticate ancient DNA
# do some basic population genetics

<h2> Data authentication</h2>

Authentication involves making sure that the DNA that you have extracted from my fossil and sequenced is indeed from the fossil and not some modern contaminant. A big difference between modern DNA and ancient DNA is the presence of chemical damage due to the passage of time.

<h3> Direct measurements of the rate of chemical damage</h3>

First, create a directory:
<pre>
mkdir 01_authentication
cd 01_authentication
</pre>

We will characterize DNA damage patterns using mapDamage, a software to estimate the rate of nucleotide substitution. In this section, we will examine some example BAM files for the presence of DNA damage patterns typical of ancient DNA.

We have a set of 10 modern and 26 ancient individuals (subsampled to 100k reads)
<pre>
find /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ -name "*bam"
</pre>

First, run mapDamage on one of the modern individuals:

<pre>
mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/modern/NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.bam -r /home/databases/references/human/hs37d5.fa --no-stats
</pre>

Examine the output:

<pre>
cd NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.mapDamage/
okular Length_plot.pdf &
okular Fragmisincorporation_plot.pdf &
cd ..
</pre>

'''Q1:''' which fragment length occurs most frequently?

'''Q2:''' what is the frequency of 5' C>T and 3' G>A substitutions ()

Run mapDamage on one of the ancient individuals
<pre>
mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ancient/allentoft_2015/RISE559.sort.rmdup.realign.md.100k.bam -r /home/databases/references/human/hs37d5.fa --no-stats
</pre>

Examine the output
<pre>
cd RISE559.sort.rmdup.realign.md.100k.mapDamage/
okular Length_plot.pdf &
okular Fragmisincorporation_plot.pdf &
</pre>

'''Q3:''' At what fragment length does the distribution show its peak?

'''Q4:''' what are the frequencies of 5' C>T (red line) and 3' G>A substitutions (blue line)?

'''Q5:''' which bases are enriched at 5' flanking position?

'''Q6:''' does your sample look ancient? if not, what might be the reason?

<H2> Population genetics </H2>

Create a new subdirectory and navigate to it:
<pre>
cd ..
mkdir 02_popgen
cd 02_popgen
</pre>

<H3>Explore the reference panel dataset</H3>

Pur reference panel dataset is in binary PLINK format, a widely used format in genetic studies (see documentation [https://www.cog-genomics.org/plink/1.9/ here]). We need to access the following files:

<pre>
ls /home/projects/22126_NGS/exercises/adna/02_popgen/plink/
</pre>

However, instead of copying them, we will create symbolic links using the ln command, these acts as placeholders and tell the operating system to pretend that there is an actual file there. This saves considerable disk space compared to copying over the files.

<pre>
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bed .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bim .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.cluster .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.fam .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.sampleInfo.txt .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/eur.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/modern.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/noneur.poplist .
</pre>

The PLINK binary format consists of 3 files:

{| class="wikitable"
| '''file'''
| '''description'''
|-
| world.bed
| | genotype data in binary format ('''not to be confused with genomic intervals bed file but it is confusing''')
|-
| world.bim
| metadata for the variants, 1 line per variant
|-
| world.fam
| metadata for the samples, 1 line per sample
|-

We also have the following files than contain extra information:

{| class="wikitable"
| '''file'''
| '''description'''
|-
|world.cluster
| pre-defined population groupings for samples (for plink)
|-
| world.sampleInfo.txt
| additional sample metadata (for plotting etc)
|}

Let us explore the metadata files:

<pre>
head world.fam
head world.bim
head world.cluster
head world.sampleInfo.txt
</pre>

'''Q7:''' How many samples / SNPs are in our dataset?

'''Q8:''' what populations are in our reference panel and what sample size do they have (trick: forgo the header using "tail -n+2", you need "sort" and uniq (prints 1 instance per repeated line), to tell "uniq" to count and print how many lines were repeated "-c"?

Calculate basic summary statistics (a simple description of the data) for the dataset:

<pre>
/home/ctools/plink-1.9/plink --bfile world --missing --out world
</pre>

'''Q9:''' are you getting the same number of variants and individuals as you did via UNIX command lines?

The world.imiss file lists the number and fraction of missing genotypes for each sample

'''Q10:''' what fraction of SNPs have a missing genotype for the Tyrolean Iceman?

<H3>Genotype and merge an ancient individual</H3>

In this section, we will merge our ancient data with the reference panel to prepare our dataset for downstream analysis genotypes for our ancient data will be obtained by randomly sampling a read from the alignments (BAM files) at the reference dataset SNP positions.

We are going to use a low-coverage individual from [https://pubmed.ncbi.nlm.nih.gov/26062507/ Allentoft et al (RISE507)], this data was obtained from an ~5100-year-old individual from the Early Bronze Age [https://en.wikipedia.org/wiki/Afanasievo_culture Afanasievo culture] in the Altai Mountains region

<pre>
ls /home/projects/22126_NGS/exercises/adna/02_popgen/bam/
</pre>

First, we need to extract a genomic interval bed file for the SNP positions of the reference panel:
<pre>
awk '{print $1"\t"($4-1)"\t"$4}' world.bim | gzip > world.snps.bed.gz
</pre>

awk is a command to create small programs. In this example, we tell it, print the first columns, the fourth column minus 1 and the fourth column again.

Inspect the results:

<pre>
zcat world.snps.bed.gz | head
</pre>

Create a read pileup file for the reference panel SNP positions (might take a few minutes)

<pre>
samtools mpileup -f /home/databases/references/human/hs37d5.fa -B -l world.snps.bed.gz /home/projects/22126_NGS/exercises/adna/02_popgen/bam/RISE507.sort.rmdup.realign.md.bam |gzip > RISE507.pileup.gz
</pre>

Examine the output:

<pre>
zcat RISE507.pileup.gz |head
</pre>

'''Q11''': how many SNPs of the reference panel are covered in RISE507?

Now we will randomly sample a DNA fragment at each position and output the results in VCF format (custom python script):
<pre>
zcat RISE507.pileup.gz | python2 /home/projects/22126_NGS/exercises/adna/02_popgen/get_haploid_vcf_from_pileup.py -r -s RISE507 |bgzip -c > RISE507.vcf.gz
</pre>
This is done because the coverage is insufficient to ensure proper genotyping.

Let us inspect the result:
<pre>
zcat RISE507.vcf.gz |grep -v "^#" |head
</pre>

We convert to plink binary format:
<pre>
/home/ctools/plink-1.9/plink --vcf RISE507.vcf.gz --make-bed --double-id --out RISE507
</pre>

Try to merge the sample with the reference panel
<pre>
/home/ctools/plink-1.9/plink --bfile world --bmerge RISE507 --out RISE507.merge
</pre>

You should get an error.

'''Q12''': how many SNPs failed the merge? What is the likely reason?

We will remove the failing SNPs and try again
<pre>
/home/ctools/plink-1.9/plink --bfile RISE507 --exclude RISE507.merge.missnp --make-bed --out RISE507.merge2
/home/ctools/plink-1.9/plink --bfile world --bmerge RISE507.merge2 --out RISE507.world
</pre>

Make a cluster file for subsetting
<pre>
awk '{print $1,$2,$1}' RISE507.world.fam > RISE507.world.cluster
</pre>

<H3>Investigate the genetic affinities of the ancient sample using PCA</H3>

In this section, we will try to place our sample within a PCA of a set of modern and ancient individuals.

First, we will have a look at the modern populations in the reference panel:
<pre>
/home/ctools/plink-1.9/plink --bfile RISE507.world --keep-clusters modern.poplist --within RISE507.world.cluster --pca header tabs --out modern
</pre>

We can plot the first two principal components using the custom R script plotPca.R

The three positional arguments are the eigenvector file, sample info file and prefix for the output

<pre>
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R modern.eigenvec world.sampleInfo.txt modern
evince modern.pca.plot.pdf &
</pre>

'''Q13:''' which populations are most differentiated along PC1?
'''Q14:''' which populations are most differentiated along PC2?

We repeat the exercise on a subset of European populations:

<pre>
/home/ctools/plink-1.9/plink --bfile RISE507.world --keep-clusters eur.poplist --within RISE507.world.cluster --pca header tabs --out eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R eur.eigenvec world.sampleInfo.txt eur
evince eur.pca.plot.pdf &
</pre>

'''Q15:''' which populations are most differentiated along PC1?
'''Q16:''' which populations are most differentiated along PC2?

Now, let us examine how the cluster of ancient individuals compared to the modern ones:

<pre>
/home/ctools/plink-1.9/plink --bfile RISE507.world --pca header tabs --out ancient.world
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.world.eigenvec world.sampleInfo.txt ancient.world
evince ancient.world.pca.plot.pdf &
</pre>

Here are some references if you want to read more about the different ancient samples:

{| class="wikitable"
| '''sample'''
| '''link'''
|-
| UstIshim
| [https://en.wikipedia.org/wiki/Ust%27-Ishim_man]
|-
| Loschbour
| [https://en.wikipedia.org/wiki/Loschbour_man] [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/]
|-
| Brana
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4269527/]
|-
| NE1
| [https://www.pnas.org/content/113/2/368]
|-
|Stuttgart
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/]
|-
| Iceman
| [https://www.iceman.it/en/the-iceman/]
|-
|Karelia
| [https://en.wikipedia.org/wiki/Karelians]
|-
| Samara
| [https://en.wikipedia.org/wiki/Samara_culture]
|-
| MA1
| [https://en.wikipedia.org/wiki/Mal%27ta%E2%80%93Buret%27_culture]
|-
| RISE507
|[https://pubmed.ncbi.nlm.nih.gov/26062507/]
|}

'''Q17:''' which ancient individuals don't cluster close to any modern individuals? what could be a plausible reason?

Repeat the exercise but remove the non-European modern individuals:

<pre>
/home/ctools/plink-1.9/plink --bfile RISE507.world --within RISE507.world.cluster --remove-clusters noneur.poplist --pca header tabs --out ancient.eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.eur.eigenvec world.sampleInfo.txt ancient.eur
evince ancient.eur.pca.plot.pdf &
</pre>

'''Q18:''' which populations are most differentiated along PC1? what could be a plausible reason?

As a final exercise, we now project the ancient individual on PCs inferred from modern Europeans:

<pre>
/home/ctools/plink-1.9/plink --bfile RISE507.world --within RISE507.world.cluster --pca-clusters eur.poplist --remove-clusters noneur.poplist --pca header tabs --out ancient_proj.eur --maf 0.01
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient_proj.eur.eigenvec world.sampleInfo.txt ancient_proj.eur
evince ancient_proj.eur.pca.plot.pdf &
</pre>

'''Q19:''' where does our study individual cluster now?

'''Q20:''' How do you explain that an individual that is found closer to the modern-day Chinese border is closer to modern Europeans than he is to the Han Chinese?

Please find answers [[Ancient_DNA_exercise_answers|here]]

Denovo solution

2024-03-19T15:35:07Z

WikiSysop: Created page with "Q1. Illumina Q1A. discarded contains reads that are too short, pair1 and pair2 files contain the read pairs were both passed trimming and singleton are reads were one of the two pairs were discarded. Q2. Around 84 Q3. N = (M*L)/(L-K+1) = (84*99)/(99-15+1) = 97.84 Genome_size = T/N = (213721367+212523694)/97.84 = 4.35Mb Q4. Mean = 259 ; SD = 11 Q5. It is lower, this means that the actual kmer peak we found (unless you found one higher than 84) is higher (this would g..."

Q1. Illumina

Q1A. discarded contains reads that are too short, pair1 and pair2 files contain the read pairs were both passed trimming and singleton are reads were one of the two pairs were discarded.

Q2. Around 84

Q3. N = (M*L)/(L-K+1) = (84*99)/(99-15+1) = 97.84
Genome_size = T/N = (213721367+212523694)/97.84 = 4.35Mb

Q4. Mean = 259 ; SD = 11

Q5. It is lower, this means that the actual kmer peak we found (unless you found one higher than 84) is higher (this would give a lower genome size).

Q6. 10 of 195 contigs were scaffolded into scaffolds, this is quite few - normally it is much higher. A reason for this could be that our insert size is quite low (~250 bp) and the repeats in the genome are larger than this.

Q7. Repeat regions

Q8. Contaminations

Q9. Because we use the reference genome as the truth it may be hard to distinguish what is a misassembly and what is true variation from the reference genome.

Q10. This is of course just visual, but it seems that most part of the reference genome is covered by our assembly, so yes.

Q11. Yes, a couple of the small contigs does not map at all, and the C1097 only maps partially. This could be sequence in our strain, but not in the reference genome.

Q12. This is a region with a lot of repeats, this is also why we cant really assemble it. It is used by V. cholerae to integrate new genes into its genome.

Q13. The Nanopore assembly only has 2 contigs and pacbio 1!