22126 - User contributions [en]

SNP calling exercise part 2

2026-01-17T16:20:11Z

Mick:

<h2>Filtering</h2>

We have seen that the VCF contains some low-quality or unreliable variant calls. Before downstream analyses, we generally want to remove poor-quality sites or annotate them so they can be excluded later. In this exercise we explore how to apply hard filters and how to remove variants in regions of poor mappability.

Please use the VCF file generated in Part 1.

<h3>Hard Filtering</h3>

Soft filtering approaches (e.g. VQSR) attempt to statistically learn which variants are “true.” However, these approaches require large cohorts or population-level resources, which may not exist for many organisms or under-sampled populations. For this reason, we often fall back on hard filtering, i.e. applying fixed cutoffs.

Hard filtering is simple but may introduce bias if the filter correlates with variant type (e.g. heterozygous sites often have lower depth). Filters should be chosen thoughtfully.

We will use the following genomic mask file:

<pre>
/home/databases/databases/GRCh38/mask99.bed.gz
</pre>

This file is in the BED interval format, which stores genomic regions as:

<pre>
chromosome start(0-based) end(1-based)
</pre>

<ul>
<li>0-based: first base has coordinate 0</li>
<li>1-based: first base has coordinate 1</li>
</ul>

This mask contains genomic regions to exclude (often low-quality or repetitive regions). Because most genotypers do not recognize duplicated regions, combining hard filtering with mappability filters is best practice.

A typical hard-filtering command using GATK is:

<pre>
/home/ctools/gatk-4.6.2.0/gatk VariantFiltration \
-V [INPUT VCF] \
-O [OUTPUT VCF] \
-filter "DP < 10.0" --filter-name "DP" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40"
</pre>

Explanation of filters:

<table class="wikitable">
<tr><th>Filter</th><th>Meaning</th></tr>

<tr>
<td><code>DP < 10</code></td>
<td>Remove sites with <10× coverage</td>
</tr>

<tr>
<td><code>QUAL < 30</code></td>
<td>Remove sites where variant quality <30
(variant QUAL ≠ genotype quality GQ — explanation:
[https://gatk.broadinstitute.org/hc/en-us/articles/360035531392 Variant QUAL vs GQ])</td>
</tr>

<tr>
<td><code>SOR > 3.0</code></td>
<td>Remove sites with strong strand bias
([https://gatk.broadinstitute.org/hc/en-us/articles/360036361772 StrandOddsRatio])</td>
</tr>

<tr>
<td><code>FS > 60</code></td>
<td>Remove variants failing Fisher Strand bias test
([https://gatk.broadinstitute.org/hc/en-us/articles/360036361992 FisherStrand])</td>
</tr>

<tr>
<td><code>MQ < 40</code></td>
<td>Remove sites where reads have low mapping quality</td>
</tr>

</table>

Note: No filter is perfect — you should progressively add filters, evaluate their impact, and ensure that you do not introduce unwanted biases.

<h4>Q1</h4>
How many sites were filtered out?
Sites that pass all filters have <code>PASS</code> in the 7th column. Use <code>grep</code> to count PASS vs non-PASS entries.

<h4>Q2</h4>
The 7th column contains the name(s) of the filters that failed.
Using <code>cut</code>, <code>sort</code>, and <code>uniq -c</code>, determine which filter removed the most sites.

<hr>

<h3>Filtering by Mappability</h3>

Next, we remove variants that fall inside low-mappability regions, because reads cannot be uniquely mapped there and false positives are common.

Use bedtools intersect to retain only variants located in high-mappability intervals (≥99% unique mappability):

<pre>
bedtools intersect -header \
-a [INPUT VCF] \
-b /home/databases/databases/GRCh38/filter99.bed.gz \
| /home/ctools/htslib-1.20/bgzip -c > [OUTPUT VCF]
</pre>

Name your output:

<pre>
NA24694_hf_map99.vcf.gz
</pre>

The "99" refers to the proportion of synthetic reads that map uniquely at that position.

<h4>Q3</h4>
How many variants remain after removing low-mappability regions?

<hr>

<h2>Annotation of Variants</h2>

Next, we examine the genomic context of variants: intronic, exonic, intergenic, UTR, etc. We use snpEff for variant annotation.

<pre>
java -jar /home/ctools/snpEff/snpEff.jar eff \
-dataDir /home/databases/databases/snpEff/ \
-htmlStats [OUTPUT HTML] \
GRCh38.99 \
[INPUT VCF] \
| /home/ctools/htslib-1.20/bgzip -c > [OUTPUT VCF]
</pre>

<ul>
<li><code>-dataDir</code>: location of snpEff databases</li>
<li><code>GRCh38.99</code>: genome version — must match the reference genome you used earlier</li>
</ul>

Run <code>snpEff</code> on your hard-filtered VCF (before mappability filtering).
This produces:

<ul>
<li>HTML report: <code>NA24694_hf.html</code></li>
<li>Annotated VCF: <code>NA24694_hf_ann.vcf.gz</code></li>
</ul>

Viewing the snpEff HTML report:

If you are using MobaXterm, you can open the HTML file directly from the
left-hand file panel.

If you are using macOS (or a standard terminal), copy the HTML file to
your local computer and open it in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/NA24694_hf.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.

<h4>Q4</h4>
Which genomic region category contains the most variants (exon, intron, upstream, downstream, UTR, etc.)?

<h4>Q5</h4>
How many variants are predicted to cause a codon change?
See explanations at: [https://en.wikipedia.org/wiki/Point_mutation Point mutation]

<hr>

Please find answers here: [SNP_calling_exercise_part_2_answers SNP_calling_exercise_part_2_answers]

Congratulations — you finished the exercise!

Note: When piping <code>bcftools view</code> into other tools, consider specifying the output type using:

<pre>
-O {b|u|z|v}
</pre>

This avoids unnecessary compression/decompression and speeds up workflows.

Program 2026

2026-01-15T14:36:26Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-12:00pm</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>Metabarcoding ([https://teaching.healthtech.dtu.dk/material/22126/2026/Metabarcoding_slides.pdf Lecture slides])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: Metabarcoding ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-15T14:35:58Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-12:00pm</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>Metabarcoding ([https://teaching.healthtech.dtu.dk/material/22126/2026/Metabarcoding_slides.pdf Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: Metabarcoding ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-15T14:35:12Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-12:00pm</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>Metabarcoding ([https://teaching.healthtech.dtu.dk/material/22126/2026/Metabarcoding_slides.pdf Lecture)</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: Metabarcoding ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-15T14:34:23Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-12:00pm</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>Metabarcoding ([https://teaching.healthtech.dtu.dk/material/22126/2026/Metabarcoding_slides.pdf Lecture)</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-14T12:20:48Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-12:00pm</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Postprocess exercise

2026-01-13T16:50:56Z

Mick:

<h2>Overview</h2>

In this exercise, you will perform essential post-alignment processing on BAM files to prepare them for reliable variant calling. Raw aligned BAM files often contain artifacts that can lead to false variants if not handled correctly. Today you will:

<ol>
<li>Mark duplicate reads in BAM files</li>
<li>Examine the effect of duplicate marking on read interpretation</li>
<li>Merge multiple sequencing libraries from the same individual into a single BAM file</li>
</ol>

First:
<ol>
<li>Navigate to your home directory</li>
<li>Create a directory called <code>postalign</code></li>
<li>Enter the <code>postalign</code> directory</li>
</ol>

<hr>

<h2>Duplicate Marking</h2>

We will work with data from a Han Chinese individual (HG00418), sequenced to approximately 40× coverage using Illumina paired-end sequencing. For speed, we only use reads mapping to chromosome 20 and only two sequencing libraries.

Library 1 BAM file:
<pre>
/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam
</pre>

The file is already trimmed, aligned, and sorted.

We will mark duplicate reads using Picard MarkDuplicates. The general command is:

<pre>
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
-I [input.bam] \
-M [metrics.txt (this is part of the output, call it what you wish)] \
-O [output.bam]
</pre>

Suggested output name: <code>ERR016028_chr20_sort_markdup.bam</code>

Q1: After running Picard, how many reads were marked as duplicates?
(Hint: this number is printed in the Picard metrics output file.)

<hr>

<h3>Inspecting the Effect of Duplicate Marking</h3>

To view reads in a specific genomic region, use:

<pre>
samtools view [input.sorted.bam] [chrom]:[start]-[end]
</pre>

The BAM file must be indexed. Picard preserves sorting, so you do not need to re-sort it.

Inspect reads in the following region for both the original file and your duplicate-marked file:

<pre>
chr20:45996339-45996839
</pre>

Identify the two reads:
<ul>
<li><code>ERR016028.5947720</code></li>
<li><code>ERR016028.18808080</code></li>
</ul>

Q2: Why did MarkDuplicates consider these reads to be duplicates?

Q3: Which of the two reads was marked as a duplicate, and how can you tell from the SAM flag or tags?

<hr>

<h2>Merging BAM Files</h2>

Often, multiple sequencing libraries (or sequencing runs) exist for the same biological sample. Before variant calling, these must be merged into a single BAM file.

You will merge:

<ul>
<li>Your duplicate-marked file</li>
<li>The second library file:</li>
</ul>

<pre>
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
</pre>

Run:

<pre>samtools</pre>

and find the command capable of merging multiple BAM files while preserving read groups and writing an index automatically. The command should keep the file sorted and generate the <code>.bai</code> index in the same step.

Q4: Which <code>samtools</code> command performs merging, with options to keep read groups and write the index?

Use the options:

<pre>
-c --write-index
</pre>

Your merged BAM file should be named:

<pre>
HG00418_chr20_sort_markdup.bam
</pre>

Inspect your merged file using:

<pre>
samtools view HG00418_chr20_sort_markdup.bam | less -S
</pre>

Q5: Which SAM/BAM field indicates the sample or library of origin for each read?

Q6: What is the term for pooling multiple samples together into a single sequencing run?

Q7: What is the computational step where we separate pooled reads back into individual samples?

<hr>

You can find the answers [[Postprocess_exercise_answers|here]]

Congratulations—you have completed the exercise!

Postprocess exercise

2026-01-13T16:50:20Z

Mick:

<h2>Overview</h2>

In this exercise, you will perform essential post-alignment processing on BAM files to prepare them for reliable variant calling. Raw aligned BAM files often contain artifacts that can lead to false variants if not handled correctly. Today you will:

<ol>
<li>Mark duplicate reads in BAM files</li>
<li>Examine the effect of duplicate marking on read interpretation</li>
<li>Merge multiple sequencing libraries from the same individual into a single BAM file</li>
</ol>

First:
<ol>
<li>Navigate to your home directory</li>
<li>Create a directory called <code>postalign</code></li>
<li>Enter the <code>postalign</code> directory</li>
</ol>

<hr>

<h2>Duplicate Marking</h2>

We will work with data from a Han Chinese individual (HG00418), sequenced to approximately 40× coverage using Illumina paired-end sequencing. For speed, we only use reads mapping to chromosome 20 and only two sequencing libraries.

Library 1 BAM file:
<pre>
/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam
</pre>

The file is already trimmed, aligned, and sorted.

We will mark duplicate reads using Picard MarkDuplicates. The general command is:

<pre>
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
-I [input.bam] \
-M [metrics.txt (this is part of the output, call it what you wish)] \
-O [output.bam]
</pre>

Suggested output name: <code>ERR016028_chr20_sort_markdup.bam</code>

Q1: After running Picard, how many reads were marked as duplicates?
(Hint: this number is printed in the Picard metrics output file.)

<hr>

<h3>Inspecting the Effect of Duplicate Marking</h3>

To view reads in a specific genomic region, use:

<pre>
samtools view [input.sorted.bam] [chrom]:[start]-[end]
</pre>

The BAM file must be indexed. Picard preserves sorting, so you do not need to re-sort it.

Inspect reads in the following region for both the original file and your duplicate-marked file:

<pre>
chr20:45996339-45996839
</pre>

Identify the two reads:
<ul>
<li><code>ERR016028.5947720</code></li>
<li><code>ERR016028.18808080</code></li>
</ul>

Q2: Why did MarkDuplicates consider these reads to be duplicates?

Q3: Which of the two reads was marked as a duplicate, and how can you tell from the SAM flag or tags?

<hr>

<h2>Merging BAM Files</h2>

Often, multiple sequencing libraries (or sequencing runs) exist for the same biological sample. Before variant calling, these must be merged into a single BAM file.

You will merge:

<ul>
<li>Your duplicate-marked file</li>
<li>The second library file:</li>
</ul>

<pre>
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
</pre>

Run:

<pre>samtools</pre>

and find the command capable of merging multiple BAM files while preserving read groups and writing an index automatically. The command should keep the file sorted and generate the <code>.bai</code> index in the same step.

Q4: Which <code>samtools</code> command performs merging, with options to keep read groups and write the index?

Use the options:

<pre>
-c --write-index
</pre>

Your merged BAM file should be named:

<pre>
HG00418_chr20_sort_markdup.bam
</pre>

Inspect your merged file using:

<pre>
samtools view HG00418_chr20_sort_markdup.bam | less -S
</pre>

Q5: Which SAM/BAM field indicates the sample or library of origin for each read?

Q6: What is the term for pooling multiple samples together into a single sequencing run?

Q7: What is the computational step where we separate pooled reads back into individual samples?

<hr>

You can find the answers [[Postprocess_exercise_answers here]]

Congratulations—you have completed the exercise!

Exercise

2026-01-12T15:41:47Z

Mick:

<h2>Phylogenomics</h2>


David A. Duchene



Marsupials are a group of mammals that are unique to Australasia and the Americas. Several major groups of marsupials first appeared between 50 and 70 million years ago, during events of fast diversification. Given these are ancient and fast events, resolving the relationships among early marsupials is difficult, and remains a matter of interest in mammalian biology.



Today's exercises focus on the most fundamental concepts in phylogenomics, with the aim of resolving longstanding questions of the evolution of Australasian marsupials.



Let's make sense of our current understanding of marsupial relationships by coding them in Newick format.


<hr>

<h2>Exercise 1</h2>


Open R and load the required packages:


<pre>
library(ape, lib.loc = "/home/ctools/Rlibs")
library(strap, lib.loc = "/home/ctools/Rlibs")

</pre>


This is how you create a phylogenetic tree object in R from a Newick string:


<pre>
myTree <- read.tree(text = "WRITE NEWICK HERE")
</pre>


To write this tree, follow the verbal description of marsupial relationships in Newick format:



The Wallabies are sisters to the Kangaroos, and this broader grouping is sister to the Possums. Sister to all these is the grouping that contains the Koalas and the Wombats. Yet another, separate group of marsupials contains the carnivorous Numbats, whose sister is a group containing the Tasmanian Devil and the now-extinct Tasmanian Tiger. It is hypothesised that that the sister to these carnivorous marsupials is a group containing the Marsupial Mole, whose closer sister is a group containing the Bandicoots and the Bilby. Sister to all of the marsupials mentioned so far is the enigmatic American Monito del Monte, and sister yet to all of these are the American Opossums. Finally, the Platypus and the Echidna form a group that is sister to all other mammals.



Make sure you add a semicolon (<code>;</code>) at the end of your tree. Now attempt to rearrange the names around so that they are in order of the least diverse to the most, while maintaining the relationships intact.



Compare your tree with the student sitting next to you. Discuss whether the Newick trees are different. Then evaluate whether the relationships in your trees are the same, even if the exact written text string is different.



If you get too many errors, then use:


<pre>
myTree <- read.tree(text = "((Elephant,Armadillo),(((Squirrel,Rabbit),(Monkey,Treeshrew)),(Shrew,(Whale,(Bat,(Cat,Rhinoceros))))));")
</pre>


Now plot your tree into a PDF using two different representations:


<pre>
pdf("myTree.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(myTree, type = "phylogram")
plot(myTree, type = "unrooted")
dev.off()
</pre>

Q1. Do the two trees in the file contain the same information?
Q2. Can you draw any information from the branch lengths in these trees?
Q3. What information about the timing of each of these divergence events is available in the first tree?
Q4. Which of the two trees might be the most appropriate in cases where you have little prior information about the data set?

<hr>

<h2>Exercise 2</h2>


Load two data alignments, and then open the basic information about them and visualize a small portion:


<pre>
# Read data
unaligned_mars <- read.FASTA("/home/projects/22126_NGS/exercises/phylogenomics/marsupials_unaligned.fasta")
aligned_mars <- as.matrix(read.FASTA("/home/projects/22126_NGS/exercises/phylogenomics/marsupials_aligned.fasta"))

###########################
# Summary of unaligned data
###########################
unaligned_mars

###########################
# Summary of aligned data
###########################
aligned_mars

#########################
# Start of unaligned data
#########################
noquote(do.call(rbind, lapply(as.character(unaligned_mars), `[`, 1:10)))

#########################
# Start of aligned data
#########################
noquote(as.character(aligned_mars)[1:11, 1:10])
</pre>

Q5. What are the primary differences between these two alignments, and why is only one of them suitable for phylogenetic inference?


The following code will remove any alignment sites (columns) with missing data (aka gaps or indels). It then builds basic trees from the complete and filtered alignments using two methods (ordinary least squares, <code>ols</code>, and balanced minimum evolution, <code>bme</code>):


<pre>
# Filter out sites with missing data
filtered_mars <- aligned_mars[, !colSums(as.character(aligned_mars) == "-") > 0]

# Make matrices of pairwise distances between taxa
dists_full <- dist.dna(aligned_mars, model = "K80", pairwise.deletion = T)
dists_filt <- dist.dna(filtered_mars, model = "K80")

# Make trees for the two data sets, under two methods each
basicTrees <- list()
basicTrees$full_ols <- fastme.ols(dists_full)
basicTrees$full_bme <- fastme.bal(dists_full)
basicTrees$filt_ols <- fastme.ols(dists_filt)
basicTrees$filt_bme <- fastme.bal(dists_filt)
</pre>

Q6. Before looking at any of the trees, what do you think are the benefits and drawbacks of removing sites with missing data?


Try plotting a few of these trees with the approaches that you used in Exercise 1.



Next, we will look at the total lengths of these trees:


<pre>
lapply(basicTrees, function(x) sum(x$edge.length))
</pre>

Q7. What do these tree lengths measure?
Q8. Why is there a difference between the filtered and unfiltered data sets?


Do not worry at this stage about the differences between the two methods, but if you have time discuss with your partner what the difference is and what it means for your interpretation of the data.


<hr>

<h2>Exercise 3</h2>


From within R, let's run IQ-TREE 3 under two different substitution models, adding statistical supports for the branches (upcoming lecture):


<pre>
# Run maximum-likelihood with a very simple model
system("/home/ctools/iqtree-3.0.1-Linux/bin/iqtree3 -s /home/projects/22126_NGS/exercises/phylogenomics/marsupials_aligned.fasta -m JC -bb 1000 -pre mars_jc")

# Run maximum-likelihood with a more complex model
system("/home/ctools/iqtree-3.0.1-Linux/bin/iqtree3 -s /home/projects/22126_NGS/exercises/phylogenomics/marsupials_aligned.fasta -m GTR+R6 -bb 1000 -pre mars_gtr")
</pre>


Now let's visualise the trees from the three different methods so far:


<pre>
# Read maximum likelihood trees
mars_jc <- read.tree("mars_jc.treefile")
mars_gtr <- read.tree("mars_gtr.treefile")

# Plot all four inferred trees into a PDF
pdf("marsupial_trees.pdf", height = 10, width = 10)
par(mfrow = c(2, 2))
plot(basicTrees$full_ols, type = "unrooted", main = "Ordinary Least Squares")
plot(basicTrees$full_bme, type = "unrooted", main = "Balanced Min Evolution")
plot(mars_jc, type = "unrooted", main = "Max Likelihood (JC)")
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
dev.off()
</pre>

Q9. Do you think that these methods lead to substantially different results? Lay out a few reasons for your answer.

<hr>

<h2>Exercise 4</h2>


Using the runs from the previous exercise, let's open the <code>.iqtree</code> files from each run (you can then exit by pressing <code>q</code>) and examine some of the details of the analyses.


<pre>
# Output summary for the run with the simple model
system("less mars_jc.iqtree")

# Output summary for the run with the more complex model
system("less mars_gtr.iqtree")
</pre>

Q10. Which of the two models has more parameters (more complexity) and which model has the best BIC score (i.e. the lowest), and what does this tell you about the two models?
Q11. Does one model infer the total tree length to be much greater than the other? Discuss the possible reason for this with the student beside you.
Q12. From the GTR run and “Rate Parameter R”, which pairs of nucleotides has the most common type of substitution, and does this tell you anything about the biochemistry of the molecules analysed?
Q13. From the same run and the “Model of rate heterogeneity”, are there any portions of the data that evolve much faster than the rest? Note that, for example, a relative rate of 2 means a portion of the data is evolving twice as fast as the mean.
Q14. From the time stamps at the bottom of this file, did one model take much longer than the other, and what could this mean if you have a very large data set?


Now let's examine the branch supports from one of these runs, using the tree that you loaded previously into R.


<pre>
pdf("mars_branch_supports.pdf")
plot(mars_gtr, type = "unrooted", use.edge.length = F)
nodelabels(mars_gtr$node.label, frame = "circle", bg = "white")
dev.off()
</pre>

Q15. What does this tell you about our overall confidence in marsupial relationships from these data, and which are likely the most difficult relationships to resolve?

<hr>

<h2>Exercise 5</h2>


A previous set of analyses has led to the gene trees for several genomic regions. Read and briefly explore these data in R.


<pre>
mars_trs <- read.tree("/home/projects/22126_NGS/exercises/phylogenomics/marsupials.tree")

# Plot 9 randomly chosen trees from the set
pdf("mars_example_gene_trees.pdf", height = 15, width = 15)
par(mfrow = c(3, 3), mar = c(0.5, 0.5, 0.5, 0.5))
for(i in sample(1:length(mars_trs), 9)) plot(mars_trs[[i]], type = "unrooted", cex = 1.5)
dev.off()
</pre>


Examine the trees in the PDF and determine whether any of them have surprising relationships at deep branches.
Speculate on the possible causes of the discordance (hint: what could be the influence/relevance of the branch lengths?).



Let's use these trees and a fast consensus method of tree inference, and compare the tree with that from maximum likelihood.


<pre>
mars_cons <- consensus(mars_trs, p = 0.5)

pdf("mars_main_trees.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
plot(mars_cons, type = "unrooted", main = "Majority-Rule Consensus")
dev.off()
</pre>

Q16. How would you qualify the signal in the gene trees regarding the early branching events in the marsupial tree, and what do you think were the biological processes that led to this signal?

<hr>

<h2>Exercise 6</h2>


Molecular dating is a difficult and advanced analysis. However, we can sometimes rely on fast methods for very large data sets or exploratory analysis. In the following, we root our tree of the marsupials and provide it to a fast dating method. We apply two calibrations: one for the root (65–90 Mya) and one for the split between Koalas and Wombats (2.5–5.5 Mya).


<pre>
# Root IQ-TREE inference
mars_tr <- root(mars_gtr, "Opossum", resolve.root = T)

# Perform dating analysis
ctrl <- chronos.control(dual.iter.max = 1000)
cal <- data.frame(node = c(20, 12), age.min = c(2.5, 6.5), age.max = c(5.5, 9))
mars_dated <- chronos(mars_tr, calibration = cal, control = ctrl)
mars_dated$edge.length <- mars_dated$edge.length * 10
mars_dated$root.time <- max(branching.times(mars_dated))

# Plot dating analysis
pdf("marsupials_dated.pdf", height = 10, width = 10)
geoscalePhylo(
mars_dated,
units = c("Period", "Epoch"),
boxes = "Epoch",
width = 3,
cex.age = 1.5,
cex.ts = 1.5,
cex.tip = 1.5
)
dev.off()
</pre>

Q17. What do these date inferences suggest about the diversification of marsupials with relation to the Cretaceous/Palaeogene mass extinction event, or other major geological transitions?

Q18. What forms of uncertainty are missing in this dated tree figure, and how would you consider incorporating them?

Program 2026

2026-01-12T09:14:14Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-12T08:44:33Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Metabarcoding Exercise]]) ([[Metabarcoding Solution]])</DD>
<DD> Luke Holman</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Exercise

2026-01-11T07:50:14Z

Mick:

<h2>Phylogenomics</h2>


David A. Duchene



Marsupials are a group of mammals that are unique to Australasia and the Americas. Several major groups of marsupials first appeared between 50 and 70 million years ago, during events of fast diversification. Given these are ancient and fast events, resolving the relationships among early marsupials is difficult, and remains a matter of interest in mammalian biology.



Today's exercises focus on the most fundamental concepts in phylogenomics, with the aim of resolving longstanding questions of the evolution of Australasian marsupials.



Let's make sense of our current understanding of marsupial relationships by coding them in Newick format.


<hr>

<h2>Exercise 1</h2>


Open R and load the required packages:


<pre>
library(ape, lib.loc = "/home/ctools/Rlibs")
library(strap, lib.loc = "/home/ctools/Rlibs")

</pre>


Create an object containing a Newick tree:


<pre>
myTree <- read.tree(text = "WRITE NEWICK HERE")
</pre>


To write this tree, follow the verbal description of marsupial relationships in Newick format:



The Wallabies are sisters to the Kangaroos, and this broader grouping is sister to the Possums. Sister to all these is the grouping that contains the Koalas and the Wombats. Yet another, separate group of marsupials contains the carnivorous Numbats, whose sister is a group containing the Tasmanian Devil and the now-extinct Tasmanian Tiger. It is hypothesised that that the sister to these carnivorous marsupials is a group containing the Marsupial Mole, whose closer sister is a group containing the Bandicoots and the Bilby. Sister to all of the marsupials mentioned so far is the enigmatic American Monito del Monte, and sister yet to all of these are the American Opossums. Finally, the Platypus and the Echidna form a group that is sister to all other mammals.



Make sure you add a semicolon (<code>;</code>) at the end of your tree. Now attempt to rearrange the names around so that they are in order of the least diverse to the most, while maintaining the relationships intact.



Compare your tree with the student sitting next to you. Discuss whether the Newick trees are different. Then evaluate whether the relationships in your trees are the same, even if the exact written text string is different.



If you get too many errors, then use:


<pre>
myTree <- read.tree(text = "((Elephant,Armadillo),(((Squirrel,Rabbit),(Monkey,Treeshrew)),(Shrew,(Whale,(Bat,(Cat,Rhinoceros))))));")
</pre>


Now plot your tree into a PDF using two different representations:


<pre>
pdf("myTree.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(myTree, type = "phylogram")
plot(myTree, type = "unrooted")
dev.off()
</pre>

Q1. Do the two trees in the file contain the same information?
Q2. Can you draw any information from the branch lengths in these trees?
Q3. What information about the timing of each of these divergence events is available in the first tree?
Q4. Which of the two trees might be the most appropriate in cases where you have little prior information about the data set?

<hr>

<h2>Exercise 2</h2>


Load two data alignments, and then open the basic information about them and visualize a small portion:


<pre>
# Read data
unaligned_mars <- read.FASTA("/home/projects/22126_NGS/exercises/phylogenomics/marsupials_unaligned.fasta")
aligned_mars <- as.matrix(read.FASTA("/home/projects/22126_NGS/exercises/phylogenomics/marsupials_aligned.fasta"))

###########################
# Summary of unaligned data
###########################
unaligned_mars

###########################
# Summary of aligned data
###########################
aligned_mars

#########################
# Start of unaligned data
#########################
noquote(do.call(rbind, lapply(as.character(unaligned_mars), `[`, 1:10)))

#########################
# Start of aligned data
#########################
noquote(as.character(aligned_mars)[1:11, 1:10])
</pre>

Q5. What are the primary differences between these two alignments, and why is only one of them suitable for phylogenetic inference?


The following code will remove any alignment sites (columns) with missing data (aka gaps or indels). It then builds basic trees from the complete and filtered alignments using two methods (ordinary least squares, <code>ols</code>, and balanced minimum evolution, <code>bme</code>):


<pre>
# Filter out sites with missing data
filtered_mars <- aligned_mars[, !colSums(as.character(aligned_mars) == "-") > 0]

# Make matrices of pairwise distances between taxa
dists_full <- dist.dna(aligned_mars, model = "K80", pairwise.deletion = T)
dists_filt <- dist.dna(filtered_mars, model = "K80")

# Make trees for the two data sets, under two methods each
basicTrees <- list()
basicTrees$full_ols <- fastme.ols(dists_full)
basicTrees$full_bme <- fastme.bal(dists_full)
basicTrees$filt_ols <- fastme.ols(dists_filt)
basicTrees$filt_bme <- fastme.bal(dists_filt)
</pre>

Q6. Before looking at any of the trees, what do you think are the benefits and drawbacks of removing sites with missing data?


Try plotting a few of these trees with the approaches that you used in Exercise 1.



Next, we will look at the total lengths of these trees:


<pre>
lapply(basicTrees, function(x) sum(x$edge.length))
</pre>

Q7. What do these tree lengths measure?
Q8. Why is there a difference between the filtered and unfiltered data sets?


Do not worry at this stage about the differences between the two methods, but if you have time discuss with your partner what the difference is and what it means for your interpretation of the data.


<hr>

<h2>Exercise 3</h2>


From within R, let's run IQ-TREE 3 under two different substitution models, adding statistical supports for the branches (upcoming lecture):


<pre>
# Run maximum-likelihood with a very simple model
system("/home/ctools/iqtree-3.0.1-Linux/bin/iqtree3 -s /home/projects/22126_NGS/exercises/phylogenomics/marsupials_aligned.fasta -m JC -bb 1000 -pre mars_jc")

# Run maximum-likelihood with a more complex model
system("/home/ctools/iqtree-3.0.1-Linux/bin/iqtree3 -s /home/projects/22126_NGS/exercises/phylogenomics/marsupials_aligned.fasta -m GTR+R6 -bb 1000 -pre mars_gtr")
</pre>


Now let's visualise the trees from the three different methods so far:


<pre>
# Read maximum likelihood trees
mars_jc <- read.tree("mars_jc.treefile")
mars_gtr <- read.tree("mars_gtr.treefile")

# Plot all four inferred trees into a PDF
pdf("marsupial_trees.pdf", height = 10, width = 10)
par(mfrow = c(2, 2))
plot(basicTrees$full_ols, type = "unrooted", main = "Ordinary Least Squares")
plot(basicTrees$full_bme, type = "unrooted", main = "Balanced Min Evolution")
plot(mars_jc, type = "unrooted", main = "Max Likelihood (JC)")
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
dev.off()
</pre>

Q9. Do you think that these methods lead to substantially different results? Lay out a few reasons for your answer.

<hr>

<h2>Exercise 4</h2>


Using the runs from the previous exercise, let's open the <code>.iqtree</code> files from each run (you can then exit by pressing <code>q</code>) and examine some of the details of the analyses.


<pre>
# Output summary for the run with the simple model
system("less mars_jc.iqtree")

# Output summary for the run with the more complex model
system("less mars_gtr.iqtree")
</pre>

Q10. Which of the two models has more parameters (more complexity) and which model has the best BIC score (i.e. the lowest), and what does this tell you about the two models?
Q11. Does one model infer the total tree length to be much greater than the other? Discuss the possible reason for this with the student beside you.
Q12. From the GTR run and “Rate Parameter R”, which pairs of nucleotides has the most common type of substitution, and does this tell you anything about the biochemistry of the molecules analysed?
Q13. From the same run and the “Model of rate heterogeneity”, are there any portions of the data that evolve much faster than the rest? Note that, for example, a relative rate of 2 means a portion of the data is evolving twice as fast as the mean.
Q14. From the time stamps at the bottom of this file, did one model take much longer than the other, and what could this mean if you have a very large data set?


Now let's examine the branch supports from one of these runs, using the tree that you loaded previously into R.


<pre>
pdf("mars_branch_supports.pdf")
plot(mars_gtr, type = "unrooted", use.edge.length = F)
nodelabels(mars_gtr$node.label, frame = "circle", bg = "white")
dev.off()
</pre>

Q15. What does this tell you about our overall confidence in marsupial relationships from these data, and which are likely the most difficult relationships to resolve?

<hr>

<h2>Exercise 5</h2>


A previous set of analyses has led to the gene trees for several genomic regions. Read and briefly explore these data in R.


<pre>
mars_trs <- read.tree("/home/projects/22126_NGS/exercises/phylogenomics/marsupials.tree")

# Plot 9 randomly chosen trees from the set
pdf("mars_example_gene_trees.pdf", height = 15, width = 15)
par(mfrow = c(3, 3), mar = c(0.5, 0.5, 0.5, 0.5))
for(i in sample(1:length(mars_trs), 9)) plot(mars_trs[[i]], type = "unrooted", cex = 1.5)
dev.off()
</pre>


Examine the trees in the PDF and determine whether any of them have surprising relationships at deep branches.
Speculate on the possible causes of the discordance (hint: what could be the influence/relevance of the branch lengths?).



Let's use these trees and a fast consensus method of tree inference, and compare the tree with that from maximum likelihood.


<pre>
mars_cons <- consensus(mars_trs, p = 0.5)

pdf("mars_main_trees.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
plot(mars_cons, type = "unrooted", main = "Majority-Rule Consensus")
dev.off()
</pre>

Q16. How would you qualify the signal in the gene trees regarding the early branching events in the marsupial tree, and what do you think were the biological processes that led to this signal?

<hr>

<h2>Exercise 6</h2>


Molecular dating is a difficult and advanced analysis. However, we can sometimes rely on fast methods for very large data sets or exploratory analysis. In the following, we root our tree of the marsupials and provide it to a fast dating method. We apply two calibrations: one for the root (65–90 Mya) and one for the split between Koalas and Wombats (2.5–5.5 Mya).


<pre>
# Root IQ-TREE inference
mars_tr <- root(mars_gtr, "Opossum", resolve.root = T)

# Perform dating analysis
ctrl <- chronos.control(dual.iter.max = 1000)
cal <- data.frame(node = c(20, 12), age.min = c(2.5, 6.5), age.max = c(5.5, 9))
mars_dated <- chronos(mars_tr, calibration = cal, control = ctrl)
mars_dated$edge.length <- mars_dated$edge.length * 10
mars_dated$root.time <- max(branching.times(mars_dated))

# Plot dating analysis
pdf("marsupials_dated.pdf", height = 10, width = 10)
geoscalePhylo(
mars_dated,
units = c("Period", "Epoch"),
boxes = "Epoch",
width = 3,
cex.age = 1.5,
cex.ts = 1.5,
cex.tip = 1.5
)
dev.off()
</pre>

Q17. What do these date inferences suggest about the diversification of marsupials with relation to the Cretaceous/Palaeogene mass extinction event, or other major geological transitions?

Q18. What forms of uncertainty are missing in this dated tree figure, and how would you consider incorporating them?

Exercise

2026-01-11T07:47:45Z

Mick:

<h2>Phylogenomics</h2>


David A. Duchene



Marsupials are a group of mammals that are unique to Australasia and the Americas. Several major groups of marsupials first appeared between 50 and 70 million years ago, during events of fast diversification. Given these are ancient and fast events, resolving the relationships among early marsupials is difficult, and remains a matter of interest in mammalian biology.



Today's exercises focus on the most fundamental concepts in phylogenomics, with the aim of resolving longstanding questions of the evolution of Australasian marsupials.



Let's make sense of our current understanding of marsupial relationships by coding them in Newick format.


<hr>

<h2>Exercise 1</h2>


Open R and load the required packages:


<pre>
library(ape, lib.loc = "/home/ctools/Rlibs")
library(strap, lib.loc = "/home/ctools/Rlibs")

</pre>


Create an object containing a Newick tree:


<pre>
myTree <- read.tree(text = "WRITE NEWICK HERE")
</pre>


To write this tree, follow the verbal description of marsupial relationships in Newick format:



The Wallabies are sisters to the Kangaroos, and this broader grouping is sister to the Possums. Sister to all these is the grouping that contains the Koalas and the Wombats. Yet another, separate group of marsupials contains the carnivorous Numbats, whose sister is a group containing the Tasmanian Devil and the now-extinct Tasmanian Tiger. It is hypothesised that that the sister to these carnivorous marsupials is a group containing the Marsupial Mole, whose closer sister is a group containing the Bandicoots and the Bilby. Sister to all of the marsupials mentioned so far is the enigmatic American Monito del Monte, and sister yet to all of these are the American Opossums. Finally, the Platypus and the Echidna form a group that is sister to all other mammals.



Make sure you add a semicolon (<code>;</code>) at the end of your tree. Now attempt to rearrange the names around so that they are in order of the least diverse to the most, while maintaining the relationships intact.



Compare your tree with the student sitting next to you. Discuss whether the Newick trees are different. Then evaluate whether the relationships in your trees are the same, even if the exact written text string is different.



If you get too many errors, then use:


<pre>
myTree <- read.tree(text = "((Elephant,Armadillo),(((Squirrel,Rabbit),(Monkey,Treeshrew)),(Shrew,(Whale,(Bat,(Cat,Rhinoceros))))));")
</pre>


Now plot your tree into a PDF using two different representations:


<pre>
pdf("myTree.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(myTree, type = "phylogram")
plot(myTree, type = "unrooted")
dev.off()
</pre>

Q1. Do the two trees in the file contain the same information?
Q2. Can you draw any information from the branch lengths in these trees?
Q3. What information about the timing of each of these divergence events is available in the first tree?
Q4. Which of the two trees might be the most appropriate in cases where you have little prior information about the data set?

<hr>

<h2>Exercise 2</h2>


Load two data alignments, and then open the basic information about them and visualize a small portion:


<pre>
# Read data
unaligned_mars <- read.FASTA("marsupials_unaligned.fasta")
aligned_mars <- as.matrix(read.FASTA("marsupials_aligned.fasta"))

###########################
# Summary of unaligned data
###########################
unaligned_mars

###########################
# Summary of aligned data
###########################
aligned_mars

#########################
# Start of unaligned data
#########################
noquote(do.call(rbind, lapply(as.character(unaligned_mars), `[`, 1:10)))

#########################
# Start of aligned data
#########################
noquote(as.character(aligned_mars)[1:11, 1:10])
</pre>

Q5. What are the primary differences between these two alignments, and why is only one of them suitable for phylogenetic inference?


The following code will remove any alignment sites (columns) with missing data (aka gaps or indels). It then builds basic trees from the complete and filtered alignments using two methods (ordinary least squares, <code>ols</code>, and balanced minimum evolution, <code>bme</code>):


<pre>
# Filter out sites with missing data
filtered_mars <- aligned_mars[, !colSums(as.character(aligned_mars) == "-") > 0]

# Make matrices of pairwise distances between taxa
dists_full <- dist.dna(aligned_mars, model = "K80", pairwise.deletion = T)
dists_filt <- dist.dna(filtered_mars, model = "K80")

# Make trees for the two data sets, under two methods each
basicTrees <- list()
basicTrees$full_ols <- fastme.ols(dists_full)
basicTrees$full_bme <- fastme.bal(dists_full)
basicTrees$filt_ols <- fastme.ols(dists_filt)
basicTrees$filt_bme <- fastme.bal(dists_filt)
</pre>

Q6. Before looking at any of the trees, what do you think are the benefits and drawbacks of removing sites with missing data?


Try plotting a few of these trees with the approaches that you used in Exercise 1.



Next, we will look at the total lengths of these trees:


<pre>
lapply(basicTrees, function(x) sum(x$edge.length))
</pre>

Q7. What do these tree lengths measure?
Q8. Why is there a difference between the filtered and unfiltered data sets?


Do not worry at this stage about the differences between the two methods, but if you have time discuss with your partner what the difference is and what it means for your interpretation of the data.


<hr>

<h2>Exercise 3</h2>


From within R, let's run IQ-TREE 3 under two different substitution models, adding statistical supports for the branches (upcoming lecture):


<pre>
# Run maximum-likelihood with a very simple model
system("/home/ctools/iqtree-3.0.1-Linux/bin/iqtree3 -s marsupials_aligned.fasta -m JC -bb 1000 -pre mars_jc")

# Run maximum-likelihood with a more complex model
system("/home/ctools/iqtree-3.0.1-Linux/bin/iqtree3 -s marsupials_aligned.fasta -m GTR+R6 -bb 1000 -pre mars_gtr")
</pre>


Now let's visualise the trees from the three different methods so far:


<pre>
# Read maximum likelihood trees
mars_jc <- read.tree("mars_jc.treefile")
mars_gtr <- read.tree("mars_gtr.treefile")

# Plot all four inferred trees into a PDF
pdf("marsupial_trees.pdf", height = 10, width = 10)
par(mfrow = c(2, 2))
plot(basicTrees$full_ols, type = "unrooted", main = "Ordinary Least Squares")
plot(basicTrees$full_bme, type = "unrooted", main = "Balanced Min Evolution")
plot(mars_jc, type = "unrooted", main = "Max Likelihood (JC)")
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
dev.off()
</pre>

Q9. Do you think that these methods lead to substantially different results? Lay out a few reasons for your answer.

<hr>

<h2>Exercise 4</h2>


Using the runs from the previous exercise, let's open the <code>.iqtree</code> files from each run (you can then exit by pressing <code>q</code>) and examine some of the details of the analyses.


<pre>
# Output summary for the run with the simple model
system("less mars_jc.iqtree")

# Output summary for the run with the more complex model
system("less mars_gtr.iqtree")
</pre>

Q10. Which of the two models has more parameters (more complexity) and which model has the best BIC score (i.e. the lowest), and what does this tell you about the two models?
Q11. Does one model infer the total tree length to be much greater than the other? Discuss the possible reason for this with the student beside you.
Q12. From the GTR run and “Rate Parameter R”, which pairs of nucleotides has the most common type of substitution, and does this tell you anything about the biochemistry of the molecules analysed?
Q13. From the same run and the “Model of rate heterogeneity”, are there any portions of the data that evolve much faster than the rest? Note that, for example, a relative rate of 2 means a portion of the data is evolving twice as fast as the mean.
Q14. From the time stamps at the bottom of this file, did one model take much longer than the other, and what could this mean if you have a very large data set?


Now let's examine the branch supports from one of these runs, using the tree that you loaded previously into R.


<pre>
pdf("mars_branch_supports.pdf")
plot(mars_gtr, type = "unrooted", use.edge.length = F)
nodelabels(mars_gtr$node.label, frame = "circle", bg = "white")
dev.off()
</pre>

Q15. What does this tell you about our overall confidence in marsupial relationships from these data, and which are likely the most difficult relationships to resolve?

<hr>

<h2>Exercise 5</h2>


A previous set of analyses has led to the gene trees for several genomic regions. Read and briefly explore these data in R.


<pre>
mars_trs <- read.tree("marsupials.tree")

# Plot 9 randomly chosen trees from the set
pdf("mars_example_gene_trees.pdf", height = 15, width = 15)
par(mfrow = c(3, 3), mar = c(0.5, 0.5, 0.5, 0.5))
for(i in sample(1:length(mars_trs), 9)) plot(mars_trs[[i]], type = "unrooted", cex = 1.5)
dev.off()
</pre>


Examine the trees in the PDF and determine whether any of them have surprising relationships at deep branches.
Speculate on the possible causes of the discordance (hint: what could be the influence/relevance of the branch lengths?).



Let's use these trees and a fast consensus method of tree inference, and compare the tree with that from maximum likelihood.


<pre>
mars_cons <- consensus(mars_trs, p = 0.5)

pdf("mars_main_trees.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
plot(mars_cons, type = "unrooted", main = "Majority-Rule Consensus")
dev.off()
</pre>

Q16. How would you qualify the signal in the gene trees regarding the early branching events in the marsupial tree, and what do you think were the biological processes that led to this signal?

<hr>

<h2>Exercise 6</h2>


Molecular dating is a difficult and advanced analysis. However, we can sometimes rely on fast methods for very large data sets or exploratory analysis. In the following, we root our tree of the marsupials and provide it to a fast dating method. We apply two calibrations: one for the root (65–90 Mya) and one for the split between Koalas and Wombats (2.5–5.5 Mya).


<pre>
# Root IQ-TREE inference
mars_tr <- root(mars_gtr, "Opossum", resolve.root = T)

# Perform dating analysis
ctrl <- chronos.control(dual.iter.max = 1000)
cal <- data.frame(node = c(20, 12), age.min = c(2.5, 6.5), age.max = c(5.5, 9))
mars_dated <- chronos(mars_tr, calibration = cal, control = ctrl)
mars_dated$edge.length <- mars_dated$edge.length * 10
mars_dated$root.time <- max(branching.times(mars_dated))

# Plot dating analysis
pdf("marsupials_dated.pdf", height = 10, width = 10)
geoscalePhylo(
mars_dated,
units = c("Period", "Epoch"),
boxes = "Epoch",
width = 3,
cex.age = 1.5,
cex.ts = 1.5,
cex.tip = 1.5
)
dev.off()
</pre>

Q17. What do these date inferences suggest about the diversification of marsupials with relation to the Cretaceous/Palaeogene mass extinction event, or other major geological transitions?

Q18. What forms of uncertainty are missing in this dated tree figure, and how would you consider incorporating them?

Exercise

2026-01-11T07:43:39Z

Mick:

<h2>Phylogenomics</h2>


David A. Duchene



Marsupials are a group of mammals that are unique to Australasia and the Americas. Several major groups of marsupials first appeared between 50 and 70 million years ago, during events of fast diversification. Given these are ancient and fast events, resolving the relationships among early marsupials is difficult, and remains a matter of interest in mammalian biology.



Today's exercises focus on the most fundamental concepts in phylogenomics, with the aim of resolving longstanding questions of the evolution of Australasian marsupials.



Let's make sense of our current understanding of marsupial relationships by coding them in Newick format.


<hr>

<h2>Exercise 1</h2>


Open R and load the required packages:


<pre>
library(ape)
library(strap)
</pre>


Create an object containing a Newick tree:


<pre>
myTree <- read.tree(text = "WRITE NEWICK HERE")
</pre>


To write this tree, follow the verbal description of marsupial relationships in Newick format:



The Wallabies are sisters to the Kangaroos, and this broader grouping is sister to the Possums. Sister to all these is the grouping that contains the Koalas and the Wombats. Yet another, separate group of marsupials contains the carnivorous Numbats, whose sister is a group containing the Tasmanian Devil and the now-extinct Tasmanian Tiger. It is hypothesised that that the sister to these carnivorous marsupials is a group containing the Marsupial Mole, whose closer sister is a group containing the Bandicoots and the Bilby. Sister to all of the marsupials mentioned so far is the enigmatic American Monito del Monte, and sister yet to all of these are the American Opossums. Finally, the Platypus and the Echidna form a group that is sister to all other mammals.



Make sure you add a semicolon (<code>;</code>) at the end of your tree. Now attempt to rearrange the names around so that they are in order of the least diverse to the most, while maintaining the relationships intact.



Compare your tree with the student sitting next to you. Discuss whether the Newick trees are different. Then evaluate whether the relationships in your trees are the same, even if the exact written text string is different.



If you get too many errors, then use:


<pre>
myTree <- read.tree(text = "((Elephant,Armadillo),(((Squirrel,Rabbit),(Monkey,Treeshrew)),(Shrew,(Whale,(Bat,(Cat,Rhinoceros))))));")
</pre>


Now plot your tree into a PDF using two different representations:


<pre>
pdf("myTree.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(myTree, type = "phylogram")
plot(myTree, type = "unrooted")
dev.off()
</pre>

Q1. Do the two trees in the file contain the same information?
Q2. Can you draw any information from the branch lengths in these trees?
Q3. What information about the timing of each of these divergence events is available in the first tree?
Q4. Which of the two trees might be the most appropriate in cases where you have little prior information about the data set?

<hr>

<h2>Exercise 2</h2>


Load two data alignments, and then open the basic information about them and visualize a small portion:


<pre>
# Read data
unaligned_mars <- read.FASTA("marsupials_unaligned.fasta")
aligned_mars <- as.matrix(read.FASTA("marsupials_aligned.fasta"))

###########################
# Summary of unaligned data
###########################
unaligned_mars

###########################
# Summary of aligned data
###########################
aligned_mars

#########################
# Start of unaligned data
#########################
noquote(do.call(rbind, lapply(as.character(unaligned_mars), `[`, 1:10)))

#########################
# Start of aligned data
#########################
noquote(as.character(aligned_mars)[1:11, 1:10])
</pre>

Q5. What are the primary differences between these two alignments, and why is only one of them suitable for phylogenetic inference?


The following code will remove any alignment sites (columns) with missing data (aka gaps or indels). It then builds basic trees from the complete and filtered alignments using two methods (ordinary least squares, <code>ols</code>, and balanced minimum evolution, <code>bme</code>):


<pre>
# Filter out sites with missing data
filtered_mars <- aligned_mars[, !colSums(as.character(aligned_mars) == "-") > 0]

# Make matrices of pairwise distances between taxa
dists_full <- dist.dna(aligned_mars, model = "K80", pairwise.deletion = T)
dists_filt <- dist.dna(filtered_mars, model = "K80")

# Make trees for the two data sets, under two methods each
basicTrees <- list()
basicTrees$full_ols <- fastme.ols(dists_full)
basicTrees$full_bme <- fastme.bal(dists_full)
basicTrees$filt_ols <- fastme.ols(dists_filt)
basicTrees$filt_bme <- fastme.bal(dists_filt)
</pre>

Q6. Before looking at any of the trees, what do you think are the benefits and drawbacks of removing sites with missing data?


Try plotting a few of these trees with the approaches that you used in Exercise 1.



Next, we will look at the total lengths of these trees:


<pre>
lapply(basicTrees, function(x) sum(x$edge.length))
</pre>

Q7. What do these tree lengths measure?
Q8. Why is there a difference between the filtered and unfiltered data sets?


Do not worry at this stage about the differences between the two methods, but if you have time discuss with your partner what the difference is and what it means for your interpretation of the data.


<hr>

<h2>Exercise 3</h2>


From within R, let's run IQ-TREE 3 under two different substitution models, adding statistical supports for the branches (upcoming lecture):


<pre>
# Run maximum-likelihood with a very simple model
system("/home/ctools/iqtree-3.0.1-Linux/bin/iqtree3 -s marsupials_aligned.fasta -m JC -bb 1000 -pre mars_jc")

# Run maximum-likelihood with a more complex model
system("/home/ctools/iqtree-3.0.1-Linux/bin/iqtree3 -s marsupials_aligned.fasta -m GTR+R6 -bb 1000 -pre mars_gtr")
</pre>


Now let's visualise the trees from the three different methods so far:


<pre>
# Read maximum likelihood trees
mars_jc <- read.tree("mars_jc.treefile")
mars_gtr <- read.tree("mars_gtr.treefile")

# Plot all four inferred trees into a PDF
pdf("marsupial_trees.pdf", height = 10, width = 10)
par(mfrow = c(2, 2))
plot(basicTrees$full_ols, type = "unrooted", main = "Ordinary Least Squares")
plot(basicTrees$full_bme, type = "unrooted", main = "Balanced Min Evolution")
plot(mars_jc, type = "unrooted", main = "Max Likelihood (JC)")
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
dev.off()
</pre>

Q9. Do you think that these methods lead to substantially different results? Lay out a few reasons for your answer.

<hr>

<h2>Exercise 4</h2>


Using the runs from the previous exercise, let's open the <code>.iqtree</code> files from each run (you can then exit by pressing <code>q</code>) and examine some of the details of the analyses.


<pre>
# Output summary for the run with the simple model
system("less mars_jc.iqtree")

# Output summary for the run with the more complex model
system("less mars_gtr.iqtree")
</pre>

Q10. Which of the two models has more parameters (more complexity) and which model has the best BIC score (i.e. the lowest), and what does this tell you about the two models?
Q11. Does one model infer the total tree length to be much greater than the other? Discuss the possible reason for this with the student beside you.
Q12. From the GTR run and “Rate Parameter R”, which pairs of nucleotides has the most common type of substitution, and does this tell you anything about the biochemistry of the molecules analysed?
Q13. From the same run and the “Model of rate heterogeneity”, are there any portions of the data that evolve much faster than the rest? Note that, for example, a relative rate of 2 means a portion of the data is evolving twice as fast as the mean.
Q14. From the time stamps at the bottom of this file, did one model take much longer than the other, and what could this mean if you have a very large data set?


Now let's examine the branch supports from one of these runs, using the tree that you loaded previously into R.


<pre>
pdf("mars_branch_supports.pdf")
plot(mars_gtr, type = "unrooted", use.edge.length = F)
nodelabels(mars_gtr$node.label, frame = "circle", bg = "white")
dev.off()
</pre>

Q15. What does this tell you about our overall confidence in marsupial relationships from these data, and which are likely the most difficult relationships to resolve?

<hr>

<h2>Exercise 5</h2>


A previous set of analyses has led to the gene trees for several genomic regions. Read and briefly explore these data in R.


<pre>
mars_trs <- read.tree("marsupials.tree")

# Plot 9 randomly chosen trees from the set
pdf("mars_example_gene_trees.pdf", height = 15, width = 15)
par(mfrow = c(3, 3), mar = c(0.5, 0.5, 0.5, 0.5))
for(i in sample(1:length(mars_trs), 9)) plot(mars_trs[[i]], type = "unrooted", cex = 1.5)
dev.off()
</pre>


Examine the trees in the PDF and determine whether any of them have surprising relationships at deep branches.
Speculate on the possible causes of the discordance (hint: what could be the influence/relevance of the branch lengths?).



Let's use these trees and a fast consensus method of tree inference, and compare the tree with that from maximum likelihood.


<pre>
mars_cons <- consensus(mars_trs, p = 0.5)

pdf("mars_main_trees.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
plot(mars_cons, type = "unrooted", main = "Majority-Rule Consensus")
dev.off()
</pre>

Q16. How would you qualify the signal in the gene trees regarding the early branching events in the marsupial tree, and what do you think were the biological processes that led to this signal?

<hr>

<h2>Exercise 6</h2>


Molecular dating is a difficult and advanced analysis. However, we can sometimes rely on fast methods for very large data sets or exploratory analysis. In the following, we root our tree of the marsupials and provide it to a fast dating method. We apply two calibrations: one for the root (65–90 Mya) and one for the split between Koalas and Wombats (2.5–5.5 Mya).


<pre>
# Root IQ-TREE inference
mars_tr <- root(mars_gtr, "Opossum", resolve.root = T)

# Perform dating analysis
ctrl <- chronos.control(dual.iter.max = 1000)
cal <- data.frame(node = c(20, 12), age.min = c(2.5, 6.5), age.max = c(5.5, 9))
mars_dated <- chronos(mars_tr, calibration = cal, control = ctrl)
mars_dated$edge.length <- mars_dated$edge.length * 10
mars_dated$root.time <- max(branching.times(mars_dated))

# Plot dating analysis
pdf("marsupials_dated.pdf", height = 10, width = 10)
geoscalePhylo(
mars_dated,
units = c("Period", "Epoch"),
boxes = "Epoch",
width = 3,
cex.age = 1.5,
cex.ts = 1.5,
cex.tip = 1.5
)
dev.off()
</pre>

Q17. What do these date inferences suggest about the diversification of marsupials with relation to the Cretaceous/Palaeogene mass extinction event, or other major geological transitions?

Q18. What forms of uncertainty are missing in this dated tree figure, and how would you consider incorporating them?

Solution

2026-01-11T07:37:45Z

Mick: Created page with "<hr> <h2>Answers</h2> Marsupials Newick <pre> myTree <- read.tree(text = "((Echidna,Platypus),(American_Opossums,(Monito_del_Monte,((((Koalas,Wombats),(Possums,(Kangaroos,Wallabies))),(((Bandicoots,Bilby),Marsupial_Mole),((Tasmanian_Devil,Tasmanian_Tiger),Numbats)))))));") </pre> Q1. The two trees contain the same information about relationships, but the rooted tree additionally contains information about the most recent common ancestor of..."

<hr>

<h2>Answers</h2>

Marsupials Newick

<pre>
myTree <- read.tree(text = "((Echidna,Platypus),(American_Opossums,(Monito_del_Monte,((((Koalas,Wombats),(Possums,(Kangaroos,Wallabies))),(((Bandicoots,Bilby),Marsupial_Mole),((Tasmanian_Devil,Tasmanian_Tiger),Numbats)))))));")
</pre>

Q1. The two trees contain the same information about relationships, but the rooted tree additionally contains information about the most recent common ancestor of the whole set (the root node), and this also adds the order of divergence events in time.

Q2. Branch lengths were not provided, so no information about amount of evolution or time can be drawn from these data.

Q3. The rooted tree gives information about the relative timing of events. For instance, we know that the Opossums were the first to diverge from all other marsupials.

Q4. We should prefer an unrooted tree unless we have strong confidence of the root placement in our data set. This is often done by trusting the signal from an unrelated “outgroup” taxon, but even this method can be misleading, such as if the outgroup is too unrelated and has a poor signal of placement among ingroup branches.

Q5. The unaligned data have different sequence lengths and do not contain gaps/indels. This means that we have not made an inference about the homology of sites in the data. Since homology is a fundamental assumption in phylogenetics, we cannot use the unaligned data for phylogenetic inference.

Q6. Regions that are difficult to align might have excess missing data, such that removing them can be beneficial. That is, we remove regions with excess uncertainty in alignment. Conversely, a small amount of missing data might indicate a genuine insertion/deletion, such that these regions can be highly informative and should not always be removed.

Q7. These lengths measure the summed amount of molecular change across the history of all samples in the data.

Q8. Gappy regions are often fast evolving or poorly aligned, such that they induce a greater amount of evolutionary change in inferences than complete data.

Q9. These methods do not lead to substantially different results. This can occur because the data are highly informative (or extremely uninformative). Another possible reason is that the parameter of interest is not difficult to infer. This is the case with phylogenetic tree topology, but phylogenetic branch lengths are more difficult, and this can be seen in modest differences among methods.

Q10. The GTR+R6 has more parameters and is therefore more complex. It also has a lower BIC score, suggesting that there is substantial complexity in these data that require multiple processes to be accounted for. Examining an even richer range of models could be beneficial.

Q11. The GTR+R6 model leads to a longer tree, suggesting that it can identify a greater number of evolutionary changes. Its lower BIC indicates that the simple JC model is failing to identify real change, probably because it does not incorporate realistic forms of variation such as rates across sites.

Q12. A–G and C–T changes are far more common than others. These are the two types transitions, which are comparatively energetically cheap and therefore expected to be far more common than transversions. The data set therefore follows a biochemical expectation.

Q13. There is a portion (~13%) that is evolving 4 times faster than the mean in the data. These could be sites with poor alignment or with limited biological importance and therefore low selection constraints.

Q14. The analysis using GTR+R6 took twice the amount of time, suggesting that in a large data set it could place a substantial computational and energetic burden. If this is calculated as excessive, it might be necessary to find a compromise with a simpler model that nonetheless captures important forms of variation in the data (e.g., transitions and transversions).

Q15. The branch supports suggest that these data offer limited confidence regarding some deep relationships among marsupial taxa. It seems particularly unclear whether the two types of possums are actually sisters, or whether the Monito del Monte is the sister to all other Australasian marsupials (versus embedded within them).

Q16. The signal across gene trees is largely consistent with that of our inferences directly from nucleotide data. However, the gene trees have substantial uncertainty regarding one of the deepest marsupial nodes, suggesting either the data are insufficient or there was a near-simultaneous diversification event among multiple groups.

Q17. The dates suggest that the split between American and Australasian marsupials occurred around the time of the final split of Gondwana. The Eocene and Oligocene saw the diversification of most of the major groupings of marsupials sampled.

Q18. The figure does not show any uncertainty in the tree topology or in the timing of divergence events. The tree topology could be shown via bootstrap values or a “cloud” of trees, while uncertainty in divergence events could be shown as bars traversing the plausible time period.

Exercise

2026-01-11T07:35:32Z

Mick:

<h2>Phylogenomics</h2>


David A. Duchene



Marsupials are a group of mammals that are unique to Australasia and the Americas. Several major groups of marsupials first appeared between 50 and 70 million years ago, during events of fast diversification. Given these are ancient and fast events, resolving the relationships among early marsupials is difficult, and remains a matter of interest in mammalian biology.



Today's exercises focus on the most fundamental concepts in phylogenomics, with the aim of resolving longstanding questions of the evolution of Australasian marsupials.



Let's make sense of our current understanding of marsupial relationships by coding them in Newick format.


<hr>

<h2>Exercise 1</h2>


Open R and load the required packages:


<pre>
library(ape)
library(strap)
</pre>


Create an object containing a Newick tree:


<pre>
myTree <- read.tree(text = "WRITE NEWICK HERE")
</pre>


To write this tree, follow the verbal description of marsupial relationships in Newick format:



The Wallabies are sisters to the Kangaroos, and this broader grouping is sister to the Possums. Sister to all these is the grouping that contains the Koalas and the Wombats. Yet another, separate group of marsupials contains the carnivorous Numbats, whose sister is a group containing the Tasmanian Devil and the now-extinct Tasmanian Tiger. It is hypothesised that that the sister to these carnivorous marsupials is a group containing the Marsupial Mole, whose closer sister is a group containing the Bandicoots and the Bilby. Sister to all of the marsupials mentioned so far is the enigmatic American Monito del Monte, and sister yet to all of these are the American Opossums. Finally, the Platypus and the Echidna form a group that is sister to all other mammals.



Make sure you add a semicolon (<code>;</code>) at the end of your tree. Now attempt to rearrange the names around so that they are in order of the least diverse to the most, while maintaining the relationships intact.



Compare your tree with the student sitting next to you. Discuss whether the Newick trees are different. Then evaluate whether the relationships in your trees are the same, even if the exact written text string is different.



If you get too many errors, then use:


<pre>
myTree <- read.tree(text = "((Elephant,Armadillo),(((Squirrel,Rabbit),(Monkey,Treeshrew)),(Shrew,(Whale,(Bat,(Cat,Rhinoceros))))));")
</pre>


Now plot your tree into a PDF using two different representations:


<pre>
pdf("myTree.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(myTree, type = "phylogram")
plot(myTree, type = "unrooted")
dev.off()
</pre>

Q1. Do the two trees in the file contain the same information?
Q2. Can you draw any information from the branch lengths in these trees?
Q3. What information about the timing of each of these divergence events is available in the first tree?
Q4. Which of the two trees might be the most appropriate in cases where you have little prior information about the data set?

<hr>

<h2>Exercise 2</h2>


Load two data alignments, and then open the basic information about them and visualize a small portion:


<pre>
# Read data
unaligned_mars <- read.FASTA("marsupials_unaligned.fasta")
aligned_mars <- as.matrix(read.FASTA("marsupials_aligned.fasta"))

###########################
# Summary of unaligned data
###########################
unaligned_mars

###########################
# Summary of aligned data
###########################
aligned_mars

#########################
# Start of unaligned data
#########################
noquote(do.call(rbind, lapply(as.character(unaligned_mars), `[`, 1:10)))

#########################
# Start of aligned data
#########################
noquote(as.character(aligned_mars)[1:11, 1:10])
</pre>

Q5. What are the primary differences between these two alignments, and why is only one of them suitable for phylogenetic inference?


The following code will remove any alignment sites (columns) with missing data (aka gaps or indels). It then builds basic trees from the complete and filtered alignments using two methods (ordinary least squares, <code>ols</code>, and balanced minimum evolution, <code>bme</code>):


<pre>
# Filter out sites with missing data
filtered_mars <- aligned_mars[, !colSums(as.character(aligned_mars) == "-") > 0]

# Make matrices of pairwise distances between taxa
dists_full <- dist.dna(aligned_mars, model = "K80", pairwise.deletion = T)
dists_filt <- dist.dna(filtered_mars, model = "K80")

# Make trees for the two data sets, under two methods each
basicTrees <- list()
basicTrees$full_ols <- fastme.ols(dists_full)
basicTrees$full_bme <- fastme.bal(dists_full)
basicTrees$filt_ols <- fastme.ols(dists_filt)
basicTrees$filt_bme <- fastme.bal(dists_filt)
</pre>

Q6. Before looking at any of the trees, what do you think are the benefits and drawbacks of removing sites with missing data?


Try plotting a few of these trees with the approaches that you used in Exercise 1.



Next, we will look at the total lengths of these trees:


<pre>
lapply(basicTrees, function(x) sum(x$edge.length))
</pre>

Q7. What do these tree lengths measure?
Q8. Why is there a difference between the filtered and unfiltered data sets?


Do not worry at this stage about the differences between the two methods, but if you have time discuss with your partner what the difference is and what it means for your interpretation of the data.


<hr>

<h2>Exercise 3</h2>


From within R, let's run IQ-TREE 3 under two different substitution models, adding statistical supports for the branches (upcoming lecture):


<pre>
# Run maximum-likelihood with a very simple model
system("iqtree3 -s marsupials_aligned.fasta -m JC -bb 1000 -pre mars_jc")

# Run maximum-likelihood with a more complex model
system("iqtree3 -s marsupials_aligned.fasta -m GTR+R6 -bb 1000 -pre mars_gtr")
</pre>


Now let's visualise the trees from the three different methods so far:


<pre>
# Read maximum likelihood trees
mars_jc <- read.tree("mars_jc.treefile")
mars_gtr <- read.tree("mars_gtr.treefile")

# Plot all four inferred trees into a PDF
pdf("marsupial_trees.pdf", height = 10, width = 10)
par(mfrow = c(2, 2))
plot(basicTrees$full_ols, type = "unrooted", main = "Ordinary Least Squares")
plot(basicTrees$full_bme, type = "unrooted", main = "Balanced Min Evolution")
plot(mars_jc, type = "unrooted", main = "Max Likelihood (JC)")
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
dev.off()
</pre>

Q9. Do you think that these methods lead to substantially different results? Lay out a few reasons for your answer.

<hr>

<h2>Exercise 4</h2>


Using the runs from the previous exercise, let's open the <code>.iqtree</code> files from each run (you can then exit by pressing <code>q</code>) and examine some of the details of the analyses.


<pre>
# Output summary for the run with the simple model
system("less mars_jc.iqtree")

# Output summary for the run with the more complex model
system("less mars_gtr.iqtree")
</pre>

Q10. Which of the two models has more parameters (more complexity) and which model has the best BIC score (i.e. the lowest), and what does this tell you about the two models?
Q11. Does one model infer the total tree length to be much greater than the other? Discuss the possible reason for this with the student beside you.
Q12. From the GTR run and “Rate Parameter R”, which pairs of nucleotides has the most common type of substitution, and does this tell you anything about the biochemistry of the molecules analysed?
Q13. From the same run and the “Model of rate heterogeneity”, are there any portions of the data that evolve much faster than the rest? Note that, for example, a relative rate of 2 means a portion of the data is evolving twice as fast as the mean.
Q14. From the time stamps at the bottom of this file, did one model take much longer than the other, and what could this mean if you have a very large data set?


Now let's examine the branch supports from one of these runs, using the tree that you loaded previously into R.


<pre>
pdf("mars_branch_supports.pdf")
plot(mars_gtr, type = "unrooted", use.edge.length = F)
nodelabels(mars_gtr$node.label, frame = "circle", bg = "white")
dev.off()
</pre>

Q15. What does this tell you about our overall confidence in marsupial relationships from these data, and which are likely the most difficult relationships to resolve?

<hr>

<h2>Exercise 5</h2>


A previous set of analyses has led to the gene trees for several genomic regions. Read and briefly explore these data in R.


<pre>
mars_trs <- read.tree("marsupials.tree")

# Plot 9 randomly chosen trees from the set
pdf("mars_example_gene_trees.pdf", height = 15, width = 15)
par(mfrow = c(3, 3), mar = c(0.5, 0.5, 0.5, 0.5))
for(i in sample(1:length(mars_trs), 9)) plot(mars_trs[[i]], type = "unrooted", cex = 1.5)
dev.off()
</pre>


Examine the trees in the PDF and determine whether any of them have surprising relationships at deep branches.
Speculate on the possible causes of the discordance (hint: what could be the influence/relevance of the branch lengths?).



Let's use these trees and a fast consensus method of tree inference, and compare the tree with that from maximum likelihood.


<pre>
mars_cons <- consensus(mars_trs, p = 0.5)

pdf("mars_main_trees.pdf", height = 7, width = 14)
par(mfrow = c(1, 2))
plot(mars_gtr, type = "unrooted", main = "Max Likelihood (GTR)")
plot(mars_cons, type = "unrooted", main = "Majority-Rule Consensus")
dev.off()
</pre>

Q16. How would you qualify the signal in the gene trees regarding the early branching events in the marsupial tree, and what do you think were the biological processes that led to this signal?

<hr>

<h2>Exercise 6</h2>


Molecular dating is a difficult and advanced analysis. However, we can sometimes rely on fast methods for very large data sets or exploratory analysis. In the following, we root our tree of the marsupials and provide it to a fast dating method. We apply two calibrations: one for the root (65–90 Mya) and one for the split between Koalas and Wombats (2.5–5.5 Mya).


<pre>
# Root IQ-TREE inference
mars_tr <- root(mars_gtr, "Opossum", resolve.root = T)

# Perform dating analysis
ctrl <- chronos.control(dual.iter.max = 1000)
cal <- data.frame(node = c(20, 12), age.min = c(2.5, 6.5), age.max = c(5.5, 9))
mars_dated <- chronos(mars_tr, calibration = cal, control = ctrl)
mars_dated$edge.length <- mars_dated$edge.length * 10
mars_dated$root.time <- max(branching.times(mars_dated))

# Plot dating analysis
pdf("marsupials_dated.pdf", height = 10, width = 10)
geoscalePhylo(
mars_dated,
units = c("Period", "Epoch"),
boxes = "Epoch",
width = 3,
cex.age = 1.5,
cex.ts = 1.5,
cex.tip = 1.5
)
dev.off()
</pre>

Q17. What do these date inferences suggest about the diversification of marsupials with relation to the Cretaceous/Palaeogene mass extinction event, or other major geological transitions?

Q18. What forms of uncertainty are missing in this dated tree figure, and how would you consider incorporating them?

Program 2026

2026-01-09T19:01:34Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: TBA ([[ Microbial_genomics_exercise ]]) ([[ solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Questions]]) ([[Solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([])</DD>
<DD>David Duchene</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Exercise]]) ([[Solution]])</DD>
<DD> David Duchene</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([Lecture slides])</DD>
<DD>David Duchene</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Exercises]]) ([[Solution]]) </DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Metabarcoding Exercise]]) ([[Metabarcoding Solution]])</DD>
<DD> Luke Holman</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Ancient DNA exercise

2026-01-09T10:18:44Z

Mick:

<H2>Overview</H2>

Adapted from Martin Sikora.

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "adna"
<LI>Navigate to the directory you just created.
</OL>

We will try to
# Authenticate ancient DNA
# do some basic population genetics

<h2> Data authentication</h2>

Authentication involves making sure that the DNA that you have extracted from my fossil and sequenced is indeed from the fossil and not some modern contaminant. A big difference between modern DNA and ancient DNA is the presence of chemical damage due to the passage of time.

<h3> Direct measurements of the rate of chemical damage</h3>

First, create a directory:
<pre>
mkdir 01_authentication
cd 01_authentication
</pre>

We will characterize DNA damage patterns using mapDamage, a software to estimate the rate of nucleotide substitution. In this section, we will examine some example BAM files for the presence of DNA damage patterns typical of ancient DNA.

We have a set of 10 modern and 26 ancient individuals (subsampled to 100k reads)
<pre>
find /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ -name "*bam"
</pre>

First, run mapDamage on one of the modern individuals:

<pre>
/home/ctools/apps/mapdamage2/2.2.3/venv/bin/mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/modern/NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.bam -r /home/databases/references/human/hs37d5.fa --no-stats
</pre>

Examine the output (either via mobaxterm or by downloading it locally):

<pre>
cd NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.mapDamage/
Length_plot.pdf
Fragmisincorporation_plot.pdf
cd ..
</pre>

'''Q1:''' which fragment length occurs most frequently?

'''Q2:''' what is the frequency of 5' C>T and 3' G>A substitutions ()

Run mapDamage on one of the ancient individuals
<pre>
/home/ctools/apps/mapdamage2/2.2.3/venv/bin/mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ancient/allentoft_2015/RISE559.sort.rmdup.realign.md.100k.bam -r /home/databases/references/human/hs37d5.fa --no-stats
</pre>

Examine the output (either via mobaxterm or by downloading it locally)
<pre>
cd RISE559.sort.rmdup.realign.md.100k.mapDamage/
Length_plot.pdf
Fragmisincorporation_plot.pdf
</pre>

'''Q3:''' At what fragment length does the distribution show its peak?

'''Q4:''' what are the frequencies of 5' C>T (red line) and 3' G>A substitutions (blue line)?

'''Q5:''' which bases are enriched at 5' flanking position?

'''Q6:''' does your sample look ancient? if not, what might be the reason?

<H2> Population genetics </H2>

Create a new subdirectory and navigate to it:
<pre>
cd ..
mkdir 02_popgen
cd 02_popgen
</pre>

<H3>Explore the reference panel dataset</H3>

Pur reference panel dataset is in binary PLINK format, a widely used format in genetic studies (see documentation [https://www.cog-genomics.org/plink/1.9/ here]). We need to access the following files:

<pre>
ls /home/projects/22126_NGS/exercises/adna/02_popgen/plink/
</pre>

However, instead of copying them, we will create symbolic links using the ln command, these acts as placeholders and tell the operating system to pretend that there is an actual file there. This saves considerable disk space compared to copying over the files.

<pre>
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bed .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bim .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.cluster .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.fam .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.sampleInfo.txt .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/eur.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/modern.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/noneur.poplist .
</pre>

The PLINK binary format consists of 3 files:

{| class="wikitable"
| '''file'''
| '''description'''
|-
| world.bed
| | genotype data in binary format ('''not to be confused with genomic intervals bed file but it is confusing''')
|-
| world.bim
| metadata for the variants, 1 line per variant
|-
| world.fam
| metadata for the samples, 1 line per sample
|-

We also have the following files than contain extra information:

{| class="wikitable"
| '''file'''
| '''description'''
|-
|world.cluster
| pre-defined population groupings for samples (for plink)
|-
| world.sampleInfo.txt
| additional sample metadata (for plotting etc)
|}

Let us explore the metadata files:

<pre>
head world.fam
head world.bim
head world.cluster
head world.sampleInfo.txt
</pre>

'''Q7:''' How many samples / SNPs are in our dataset?

'''Q8:''' what populations are in our reference panel and what sample size do they have (trick: forgo the header using "tail -n+2", you need "sort" and uniq (prints 1 instance per repeated line), to tell "uniq" to count and print how many lines were repeated "-c"?

Calculate basic summary statistics (a simple description of the data) for the dataset:

<pre>
/home/ctools/plink --bfile world --missing --out world
</pre>

'''Q9:''' are you getting the same number of variants and individuals as you did via UNIX command lines?

The world.imiss file lists the number and fraction of missing genotypes for each sample

'''Q10:''' what fraction of SNPs have a missing genotype for the Tyrolean Iceman?

<H3>Genotype and merge an ancient individual</H3>

In this section, we will merge our ancient data with the reference panel to prepare our dataset for downstream analysis genotypes for our ancient data will be obtained by randomly sampling a read from the alignments (BAM files) at the reference dataset SNP positions.

We are going to use a low-coverage individual from [https://pubmed.ncbi.nlm.nih.gov/26062507/ Allentoft et al (RISE507)], this data was obtained from an ~5100-year-old individual from the Early Bronze Age [https://en.wikipedia.org/wiki/Afanasievo_culture Afanasievo culture] in the Altai Mountains region

<pre>
ls /home/projects/22126_NGS/exercises/adna/02_popgen/bam/
</pre>

First, we need to extract a genomic interval bed file for the SNP positions of the reference panel:
<pre>
awk '{print $1"\t"($4-1)"\t"$4}' world.bim | gzip > world.snps.bed.gz
</pre>

awk is a command to create small programs. In this example, we tell it, print the first columns, the fourth column minus 1 and the fourth column again.

Inspect the results:

<pre>
zcat world.snps.bed.gz | head
</pre>

Create a read pileup file for the reference panel SNP positions (might take a few minutes)

<pre>
samtools mpileup -f /home/databases/references/human/hs37d5.fa -B -l world.snps.bed.gz /home/projects/22126_NGS/exercises/adna/02_popgen/bam/RISE507.sort.rmdup.realign.md.bam |gzip > RISE507.pileup.gz
</pre>

Examine the output:

<pre>
zcat RISE507.pileup.gz |head
</pre>

'''Q11''': how many SNPs of the reference panel are covered in RISE507?

Now we will randomly sample a DNA fragment at each position and output the results in VCF format (custom python script):
<pre>
zcat RISE507.pileup.gz | /home/ctools/Python-2.7.18/bin/python2.7 /home/projects/22126_NGS/exercises/adna/02_popgen/get_haploid_vcf_from_pileup.py -r -s RISE507 |/home/ctools/htslib-1.20/bgzip -c > RISE507.vcf.gz
</pre>
This is done because the coverage is insufficient to ensure proper genotyping.

Let us inspect the result:
<pre>
zcat RISE507.vcf.gz |grep -v "^#" |head
</pre>

We convert to plink binary format:
<pre>
/home/ctools/plink --vcf RISE507.vcf.gz --make-bed --double-id --out RISE507
</pre>

Try to merge the sample with the reference panel
<pre>
/home/ctools/plink --bfile world --bmerge RISE507 --out RISE507.merge
</pre>

You should get an error.

'''Q12''': how many SNPs failed the merge? What is the likely reason?

We will remove the failing SNPs and try again
<pre>
/home/ctools/plink --bfile RISE507 --exclude RISE507.merge.missnp --make-bed --out RISE507.merge2
/home/ctools/plink --bfile world --bmerge RISE507.merge2 --out RISE507.world
</pre>

Make a cluster file for subsetting
<pre>
awk '{print $1,$2,$1}' RISE507.world.fam > RISE507.world.cluster
</pre>

<H3>Investigate the genetic affinities of the ancient sample using PCA</H3>

In this section, we will try to place our sample within a PCA of a set of modern and ancient individuals.

First, we will have a look at the modern populations in the reference panel:
<pre>
/home/ctools/plink --bfile RISE507.world --keep-clusters modern.poplist --within RISE507.world.cluster --pca header tabs --out modern
</pre>

We can plot the first two principal components using the custom R script plotPca.R

The three positional arguments are the eigenvector file, sample info file and prefix for the output (view the pdf either via mobaxterm or by downloading it locally):

<pre>
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R modern.eigenvec world.sampleInfo.txt modern
modern.pca.plot.pdf
</pre>

'''Q13:''' which populations are most differentiated along PC1?
'''Q14:''' which populations are most differentiated along PC2?

We repeat the exercise on a subset of European populations (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --keep-clusters eur.poplist --within RISE507.world.cluster --pca header tabs --out eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R eur.eigenvec world.sampleInfo.txt eur
eur.pca.plot.pdf
</pre>

'''Q15:''' which populations are most differentiated along PC1?
'''Q16:''' which populations are most differentiated along PC2?

Now, let us examine how the cluster of ancient individuals compared to the modern ones (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --pca header tabs --out ancient.world
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.world.eigenvec world.sampleInfo.txt ancient.world
ancient.world.pca.plot.pdf
</pre>

Here are some references if you want to read more about the different ancient samples:

{| class="wikitable"
| '''sample'''
| '''link'''
|-
| UstIshim
| [https://en.wikipedia.org/wiki/Ust%27-Ishim_man]
|-
| Loschbour
| [https://en.wikipedia.org/wiki/Loschbour_man] [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/]
|-
| Brana
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4269527/]
|-
| NE1
| [https://www.pnas.org/content/113/2/368]
|-
|Stuttgart
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/]
|-
| Iceman
| [https://www.iceman.it/en/the-iceman/]
|-
|Karelia
| [https://en.wikipedia.org/wiki/Karelians]
|-
| Samara
| [https://en.wikipedia.org/wiki/Samara_culture]
|-
| MA1
| [https://en.wikipedia.org/wiki/Mal%27ta%E2%80%93Buret%27_culture]
|-
| RISE507
|[https://pubmed.ncbi.nlm.nih.gov/26062507/]
|}

'''Q17:''' which ancient individuals don't cluster close to any modern individuals? what could be a plausible reason?

Repeat the exercise but remove the non-European modern individuals (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --within RISE507.world.cluster --remove-clusters noneur.poplist --pca header tabs --out ancient.eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.eur.eigenvec world.sampleInfo.txt ancient.eur
ancient.eur.pca.plot.pdf
</pre>

'''Q18:''' which populations are most differentiated along PC1? what could be a plausible reason?

As a final exercise, we now project the ancient individual on PCs inferred from modern Europeans(view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --within RISE507.world.cluster --pca-clusters eur.poplist --remove-clusters noneur.poplist --pca header tabs --out ancient_proj.eur --maf 0.01
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient_proj.eur.eigenvec world.sampleInfo.txt ancient_proj.eur
ancient_proj.eur.pca.plot.pdf
</pre>

'''Q19:''' where does our study individual cluster now?

'''Q20:''' How do you explain that an individual that is found closer to the modern-day Chinese border is closer to modern Europeans than he is to the Han Chinese?

Please find answers [[Ancient_DNA_exercise_answers|here]]

Ancient DNA exercise

2026-01-09T10:07:16Z

Mick:

<H2>Overview</H2>

Adapted from Martin Sikora.

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "adna"
<LI>Navigate to the directory you just created.
</OL>

We will try to
# Authenticate ancient DNA
# do some basic population genetics

<h2> Data authentication</h2>

Authentication involves making sure that the DNA that you have extracted from my fossil and sequenced is indeed from the fossil and not some modern contaminant. A big difference between modern DNA and ancient DNA is the presence of chemical damage due to the passage of time.

<h3> Direct measurements of the rate of chemical damage</h3>

First, create a directory:
<pre>
mkdir 01_authentication
cd 01_authentication
</pre>

We will characterize DNA damage patterns using mapDamage, a software to estimate the rate of nucleotide substitution. In this section, we will examine some example BAM files for the presence of DNA damage patterns typical of ancient DNA.

We have a set of 10 modern and 26 ancient individuals (subsampled to 100k reads)
<pre>
find /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ -name "*bam"
</pre>

First, run mapDamage on one of the modern individuals:

<pre>
/home/ctools/apps/mapdamage2/2.2.3/venv/bin/mapDamage/mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/modern/NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.bam -r /home/databases/references/human/hs37d5.fa --no-stats
</pre>

Examine the output (either via mobaxterm or by downloading it locally):

<pre>
cd NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.mapDamage/
Length_plot.pdf
Fragmisincorporation_plot.pdf
cd ..
</pre>

'''Q1:''' which fragment length occurs most frequently?

'''Q2:''' what is the frequency of 5' C>T and 3' G>A substitutions ()

Run mapDamage on one of the ancient individuals
<pre>
/home/ctools/apps/mapdamage2/2.2.3/venv/bin/mapDamage/mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ancient/allentoft_2015/RISE559.sort.rmdup.realign.md.100k.bam -r /home/databases/references/human/hs37d5.fa --no-stats
</pre>

Examine the output (either via mobaxterm or by downloading it locally)
<pre>
cd RISE559.sort.rmdup.realign.md.100k.mapDamage/
Length_plot.pdf
Fragmisincorporation_plot.pdf
</pre>

'''Q3:''' At what fragment length does the distribution show its peak?

'''Q4:''' what are the frequencies of 5' C>T (red line) and 3' G>A substitutions (blue line)?

'''Q5:''' which bases are enriched at 5' flanking position?

'''Q6:''' does your sample look ancient? if not, what might be the reason?

<H2> Population genetics </H2>

Create a new subdirectory and navigate to it:
<pre>
cd ..
mkdir 02_popgen
cd 02_popgen
</pre>

<H3>Explore the reference panel dataset</H3>

Pur reference panel dataset is in binary PLINK format, a widely used format in genetic studies (see documentation [https://www.cog-genomics.org/plink/1.9/ here]). We need to access the following files:

<pre>
ls /home/projects/22126_NGS/exercises/adna/02_popgen/plink/
</pre>

However, instead of copying them, we will create symbolic links using the ln command, these acts as placeholders and tell the operating system to pretend that there is an actual file there. This saves considerable disk space compared to copying over the files.

<pre>
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bed .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bim .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.cluster .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.fam .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.sampleInfo.txt .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/eur.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/modern.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/noneur.poplist .
</pre>

The PLINK binary format consists of 3 files:

{| class="wikitable"
| '''file'''
| '''description'''
|-
| world.bed
| | genotype data in binary format ('''not to be confused with genomic intervals bed file but it is confusing''')
|-
| world.bim
| metadata for the variants, 1 line per variant
|-
| world.fam
| metadata for the samples, 1 line per sample
|-

We also have the following files than contain extra information:

{| class="wikitable"
| '''file'''
| '''description'''
|-
|world.cluster
| pre-defined population groupings for samples (for plink)
|-
| world.sampleInfo.txt
| additional sample metadata (for plotting etc)
|}

Let us explore the metadata files:

<pre>
head world.fam
head world.bim
head world.cluster
head world.sampleInfo.txt
</pre>

'''Q7:''' How many samples / SNPs are in our dataset?

'''Q8:''' what populations are in our reference panel and what sample size do they have (trick: forgo the header using "tail -n+2", you need "sort" and uniq (prints 1 instance per repeated line), to tell "uniq" to count and print how many lines were repeated "-c"?

Calculate basic summary statistics (a simple description of the data) for the dataset:

<pre>
/home/ctools/plink --bfile world --missing --out world
</pre>

'''Q9:''' are you getting the same number of variants and individuals as you did via UNIX command lines?

The world.imiss file lists the number and fraction of missing genotypes for each sample

'''Q10:''' what fraction of SNPs have a missing genotype for the Tyrolean Iceman?

<H3>Genotype and merge an ancient individual</H3>

In this section, we will merge our ancient data with the reference panel to prepare our dataset for downstream analysis genotypes for our ancient data will be obtained by randomly sampling a read from the alignments (BAM files) at the reference dataset SNP positions.

We are going to use a low-coverage individual from [https://pubmed.ncbi.nlm.nih.gov/26062507/ Allentoft et al (RISE507)], this data was obtained from an ~5100-year-old individual from the Early Bronze Age [https://en.wikipedia.org/wiki/Afanasievo_culture Afanasievo culture] in the Altai Mountains region

<pre>
ls /home/projects/22126_NGS/exercises/adna/02_popgen/bam/
</pre>

First, we need to extract a genomic interval bed file for the SNP positions of the reference panel:
<pre>
awk '{print $1"\t"($4-1)"\t"$4}' world.bim | gzip > world.snps.bed.gz
</pre>

awk is a command to create small programs. In this example, we tell it, print the first columns, the fourth column minus 1 and the fourth column again.

Inspect the results:

<pre>
zcat world.snps.bed.gz | head
</pre>

Create a read pileup file for the reference panel SNP positions (might take a few minutes)

<pre>
samtools mpileup -f /home/databases/references/human/hs37d5.fa -B -l world.snps.bed.gz /home/projects/22126_NGS/exercises/adna/02_popgen/bam/RISE507.sort.rmdup.realign.md.bam |gzip > RISE507.pileup.gz
</pre>

Examine the output:

<pre>
zcat RISE507.pileup.gz |head
</pre>

'''Q11''': how many SNPs of the reference panel are covered in RISE507?

Now we will randomly sample a DNA fragment at each position and output the results in VCF format (custom python script):
<pre>
zcat RISE507.pileup.gz | /home/ctools/Python-2.7.18/bin/python2.7 /home/projects/22126_NGS/exercises/adna/02_popgen/get_haploid_vcf_from_pileup.py -r -s RISE507 |/home/ctools/htslib-1.20/bgzip -c > RISE507.vcf.gz
</pre>
This is done because the coverage is insufficient to ensure proper genotyping.

Let us inspect the result:
<pre>
zcat RISE507.vcf.gz |grep -v "^#" |head
</pre>

We convert to plink binary format:
<pre>
/home/ctools/plink --vcf RISE507.vcf.gz --make-bed --double-id --out RISE507
</pre>

Try to merge the sample with the reference panel
<pre>
/home/ctools/plink --bfile world --bmerge RISE507 --out RISE507.merge
</pre>

You should get an error.

'''Q12''': how many SNPs failed the merge? What is the likely reason?

We will remove the failing SNPs and try again
<pre>
/home/ctools/plink --bfile RISE507 --exclude RISE507.merge.missnp --make-bed --out RISE507.merge2
/home/ctools/plink --bfile world --bmerge RISE507.merge2 --out RISE507.world
</pre>

Make a cluster file for subsetting
<pre>
awk '{print $1,$2,$1}' RISE507.world.fam > RISE507.world.cluster
</pre>

<H3>Investigate the genetic affinities of the ancient sample using PCA</H3>

In this section, we will try to place our sample within a PCA of a set of modern and ancient individuals.

First, we will have a look at the modern populations in the reference panel:
<pre>
/home/ctools/plink --bfile RISE507.world --keep-clusters modern.poplist --within RISE507.world.cluster --pca header tabs --out modern
</pre>

We can plot the first two principal components using the custom R script plotPca.R

The three positional arguments are the eigenvector file, sample info file and prefix for the output (view the pdf either via mobaxterm or by downloading it locally):

<pre>
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R modern.eigenvec world.sampleInfo.txt modern
modern.pca.plot.pdf
</pre>

'''Q13:''' which populations are most differentiated along PC1?
'''Q14:''' which populations are most differentiated along PC2?

We repeat the exercise on a subset of European populations (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --keep-clusters eur.poplist --within RISE507.world.cluster --pca header tabs --out eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R eur.eigenvec world.sampleInfo.txt eur
eur.pca.plot.pdf
</pre>

'''Q15:''' which populations are most differentiated along PC1?
'''Q16:''' which populations are most differentiated along PC2?

Now, let us examine how the cluster of ancient individuals compared to the modern ones (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --pca header tabs --out ancient.world
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.world.eigenvec world.sampleInfo.txt ancient.world
ancient.world.pca.plot.pdf
</pre>

Here are some references if you want to read more about the different ancient samples:

{| class="wikitable"
| '''sample'''
| '''link'''
|-
| UstIshim
| [https://en.wikipedia.org/wiki/Ust%27-Ishim_man]
|-
| Loschbour
| [https://en.wikipedia.org/wiki/Loschbour_man] [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/]
|-
| Brana
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4269527/]
|-
| NE1
| [https://www.pnas.org/content/113/2/368]
|-
|Stuttgart
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/]
|-
| Iceman
| [https://www.iceman.it/en/the-iceman/]
|-
|Karelia
| [https://en.wikipedia.org/wiki/Karelians]
|-
| Samara
| [https://en.wikipedia.org/wiki/Samara_culture]
|-
| MA1
| [https://en.wikipedia.org/wiki/Mal%27ta%E2%80%93Buret%27_culture]
|-
| RISE507
|[https://pubmed.ncbi.nlm.nih.gov/26062507/]
|}

'''Q17:''' which ancient individuals don't cluster close to any modern individuals? what could be a plausible reason?

Repeat the exercise but remove the non-European modern individuals (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --within RISE507.world.cluster --remove-clusters noneur.poplist --pca header tabs --out ancient.eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.eur.eigenvec world.sampleInfo.txt ancient.eur
ancient.eur.pca.plot.pdf
</pre>

'''Q18:''' which populations are most differentiated along PC1? what could be a plausible reason?

As a final exercise, we now project the ancient individual on PCs inferred from modern Europeans(view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --within RISE507.world.cluster --pca-clusters eur.poplist --remove-clusters noneur.poplist --pca header tabs --out ancient_proj.eur --maf 0.01
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient_proj.eur.eigenvec world.sampleInfo.txt ancient_proj.eur
ancient_proj.eur.pca.plot.pdf
</pre>

'''Q19:''' where does our study individual cluster now?

'''Q20:''' How do you explain that an individual that is found closer to the modern-day Chinese border is closer to modern Europeans than he is to the Han Chinese?

Please find answers [[Ancient_DNA_exercise_answers|here]]

Program 2026

2026-01-09T08:48:36Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: TBA ([[ Microbial_genomics_exercise ]]) ([[ solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Questions]]) ([[Solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([])</DD>
<DD>David Duchene</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Exercise]]) ([[Solution]])</DD>
<DD> David Duchene</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([Lecture slides])</DD>
<DD>David Duchene</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Exercises]]) ([[Solution]]) </DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Metabarcoding Exercise]]) ([[Metabarcoding Solution]])</DD>
<DD> Luke Holman</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024.pdf Test 2025])([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024_withA.pdf answers])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Exercise and answers

2026-01-09T08:09:07Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit
([https://pubmed.ncbi.nlm.nih.gov/28723903/ Serra et al., 2017]):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Slurm / sbatch on the cluster</h2>


The cluster uses Slurm to schedule jobs. Instead of running long commands directly in the terminal,
you write a small script (an <code>.sbatch</code> file) that requests resources (time, CPUs, memory) and runs your commands.
Slurm decides when and where your job runs so users don’t interfere with each other.



After you submit a job you can safely disconnect without your job crashing (similar idea to <code>screen</code>).
Slurm writes standard output and errors to log files, so you can always see what happened.


<ul>
<li><code>sbatch myjob.sbatch</code> — submit a job (returns a <code><job_id></code>)</li>
<li><code>squeue -u $USER</code> — check your jobs</li>
<li><code>scancel <job_id></code> — cancel a job</li>
</ul>

<hr>

<h2>Setup conda environment to run TADbit later</h2>
You may get notified something is missing — just accept.

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch \
/home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 00_index.sbatch
</pre>


Open the new copy with your favourite editor (for example: <code>emacs 00_index.sbatch</code>)
and paste the following at the bottom of the file after the existing <code>#SBATCH ...</code> lines.


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE
cd ${data_dir}

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 \
-i refGenome/GCF_000002315.6_GRCg6a_genomic.fna \
-o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


If you were to run this, you would submit the job with:


<pre>
sbatch 00_index.sbatch
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK INSTEAD.


We make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem \
/home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna \
/home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While the indexing step would run, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Create <code>01_fastp.sbatch</code> from the template:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 01_fastp.sbatch
</pre>


Open <code>01_fastp.sbatch</code> and paste the following at the bottom (after the <code>#SBATCH</code> lines):


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

# Launch fastp: raw reads are read from course folder, clean fastqs are stored in your folder.
# Enable adapter detection; trim the first 5 bases (often lower quality).
# Use 4 threads and minimum read length 30 (remove reads shorter than this after trimming).
fastp \
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-o clean/${sample}_R1.clean.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
--trim_front1 5 \
-w 4 \
-l 30 \
-h ${sample}.html
</pre>


Submit the script to the queue:


<pre>
sbatch 01_fastp.sbatch
</pre>


When it finishes, copy the HTML report to your local computer and open it in a browser
(replace <code>SERVER</code> and <code>USER</code> as needed):


<pre>
scp USER@SERVER:/home/people/USER/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete.


Question: Check the HTML report. What percentage of reads are kept?
Answer: It should be about ~96.4%. No massive adapter content or low quality sequences. After mapping we will inspect ligation/digestion patterns in more detail.

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Create <code>02_map.sbatch</code> from the template and add the commands below (paste at the bottom after <code>#SBATCH</code> lines).


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 02_map.sbatch
</pre>

<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

sample="liver"
ref="refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/${sample}"
mkdir -p ${wd}

# Two enzymes used in this experiment (double digestion)
enz="MboI HinfI"

# Map read 1
rd=1
tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--renz ${enz} \
-C 6
</pre>


Submit the job:


<pre>
sbatch 02_map.sbatch
</pre>


If you want to inspect the job output, check the log files in <code>SCRIPTS/log</code> (or the directory where you submitted the job,
depending on the template). TADbit also produces PNG plots inside <code>tadbit_dirs/liver/</code>.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️
Answer: Cutting frequency differs between 4-cutters and 6-cutters and influences fragment size distribution, ligation probabilities, and contact resolution. Using two enzymes increases the diversity of ligation junctions. Compare with Micro-C, which uses MNase digestion, so it cuts evenly through the genome.


Useful background figures (open locally):


<ul>
<li>[https://teaching.healthtech.dtu.dk/material/22126/2026/Fragment_histogram.pdf Fragment size histogram]</li>
<li>[https://teaching.healthtech.dtu.dk/material/22126/2026/ligation_deconvolution.png Hi-C sequencing quality: digestion/ligation deconvolution ]</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.



Note on runtime: this step can be quick or can take longer depending on load and filesystem performance.
Try to get it launched, then continue reading / looking at the provided plots while it runs (this is a good time for a break).


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.



Create <code>03_parse.sbatch</code> from the template and (important!) update the template to request 10 CPUs
(e.g. change <code>#SBATCH -c 4</code> to <code>#SBATCH -c 10</code>), then paste the commands below at the bottom.


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 03_parse.sbatch
</pre>

<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
ref="/home/people/$USER/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/${sample}"

tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input
</pre>


Submit the job:


<pre>
sbatch 03_parse.sbatch
</pre>

Question: Is it possible to retrieve multiple contacting regions?
Answer: Consider complex ligation products (read pairs mapping to different fragments in the same molecule, i.e., multiple contacts) and multi-mapping artifacts; TADbit focuses on valid pairs as operationally defined by the filters. Multi-contact methods (e.g., Pore-C, SPRITE) address this explicitly, but standard Hi-C largely models binary contacts per ligation event. We can view it on the bam file in the next step.

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.
To see what the filter numbers mean, check:


<pre>
tadbit filter --help
</pre>


Run filtering (you can run in an sbatch script or interactively):


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/${sample}"

tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
cd tadbit_dirs/$sample

tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Answer: Check the “valid pairs” section of <code>tadbit describe</code> after filtering to get the exact count and percentage.

Question: The total number of filtered reads is not equal to the initial number of reads… Why?
Answer: Because a read pair can be assigned to more than one category (e.g., a dangling end that is also a duplicate). Categories are not mutually exclusive, so percentages can overlap.

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.
If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.



Create an sbatch script (e.g. <code>04_norm_vanilla.sbatch</code>) if you like, then paste the commands below at the bottom.


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 04_norm_vanilla.sbatch
</pre>

<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

sample="liver"
wd="tadbit_dirs/${sample}"

# Define the resolution
res="100000" # 100 kb

# Normalization method
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100

tadbit normalize \
-w ${wd} \
-r ${res} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


Submit it:


<pre>
sbatch 04_norm_vanilla.sbatch
</pre>


Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Optional reading on normalization strategies:



[https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105 Overview article on Hi-C normalization strategies]


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.
This step can take a while, so using an sbatch script is recommended.


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 05_bin_chr1.sbatch
</pre>

<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/${sample}"
res="100000"
chrom="chr1"
norm="Vanilla"

tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm ${norm} \
--format "png" \
--cpus 6
</pre>


Submit it:


<pre>
sbatch 05_bin_chr1.sbatch
</pre>


Example matrices (open locally):


<ul>
<li>[https://teaching.healthtech.dtu.dk/material/22126/2026/Raw_HiC.png Raw Hi-C matrix] </li>
<li>[https://teaching.healthtech.dtu.dk/material/22126/2026/Normal_HiC.png Normalized Hi-C matrix] </li>

Congratulations, you finished the exercise!

Exercise and answers

2026-01-09T08:02:44Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit
(<a href="https://pubmed.ncbi.nlm.nih.gov/28723903/" target="_blank" rel="noopener">Serra et al., 2017</a>):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Slurm / sbatch on the cluster</h2>


The cluster uses Slurm to schedule jobs. Instead of running long commands directly in the terminal,
you write a small script (an <code>.sbatch</code> file) that requests resources (time, CPUs, memory) and runs your commands.
Slurm decides when and where your job runs so users don’t interfere with each other.



After you submit a job you can safely disconnect without your job crashing (similar idea to <code>screen</code>).
Slurm writes standard output and errors to log files, so you can always see what happened.


<ul>
<li><code>sbatch myjob.sbatch</code> — submit a job (returns a <code><job_id></code>)</li>
<li><code>squeue -u $USER</code> — check your jobs</li>
<li><code>scancel <job_id></code> — cancel a job</li>
</ul>

<hr>

<h2>Setup conda environment to run TADbit later</h2>
You may get notified something is missing — just accept.

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch \
/home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 00_index.sbatch
</pre>


Open the new copy with your favourite editor (for example: <code>emacs 00_index.sbatch</code>)
and paste the following at the bottom of the file after the existing <code>#SBATCH ...</code> lines.


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE
cd ${data_dir}

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 \
-i refGenome/GCF_000002315.6_GRCg6a_genomic.fna \
-o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


If you were to run this, you would submit the job with:


<pre>
sbatch 00_index.sbatch
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK INSTEAD.


We make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem \
/home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna \
/home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While the indexing step would run, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Create <code>01_fastp.sbatch</code> from the template:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 01_fastp.sbatch
</pre>


Open <code>01_fastp.sbatch</code> and paste the following at the bottom (after the <code>#SBATCH</code> lines):


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

# Launch fastp: raw reads are read from course folder, clean fastqs are stored in your folder.
# Enable adapter detection; trim the first 5 bases (often lower quality).
# Use 4 threads and minimum read length 30 (remove reads shorter than this after trimming).
fastp \
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-o clean/${sample}_R1.clean.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
--trim_front1 5 \
-w 4 \
-l 30 \
-h ${sample}.html
</pre>


Submit the script to the queue:


<pre>
sbatch 01_fastp.sbatch
</pre>


When it finishes, copy the HTML report to your local computer and open it in a browser
(replace <code>SERVER</code> and <code>USER</code> as needed):


<pre>
scp USER@SERVER:/home/people/USER/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete.


Question: Check the HTML report. What percentage of reads are kept?
Answer: It should be about ~96.4%. No massive adapter content or low quality sequences. After mapping we will inspect ligation/digestion patterns in more detail.

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Create <code>02_map.sbatch</code> from the template and add the commands below (paste at the bottom after <code>#SBATCH</code> lines).


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 02_map.sbatch
</pre>

<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

sample="liver"
ref="refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/${sample}"
mkdir -p ${wd}

# Two enzymes used in this experiment (double digestion)
enz="MboI HinfI"

# Map read 1
rd=1
tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--renz ${enz} \
-C 6
</pre>


Submit the job:


<pre>
sbatch 02_map.sbatch
</pre>


If you want to inspect the job output, check the log files in <code>SCRIPTS/log</code> (or the directory where you submitted the job,
depending on the template). TADbit also produces PNG plots inside <code>tadbit_dirs/liver/</code>.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️
Answer: Cutting frequency differs between 4-cutters and 6-cutters and influences fragment size distribution, ligation probabilities, and contact resolution. Using two enzymes increases the diversity of ligation junctions. Compare with Micro-C, which uses MNase digestion, so it cuts evenly through the genome.


Useful background figures (open locally):


<ul>
<li><a href="https://teaching.healthtech.dtu.dk/material/22126/2026/Fragment_histogram.pdf" target="_blank" rel="noopener">Fragment size histogram (PDF)</a></li>
<li><a href="https://teaching.healthtech.dtu.dk/material/22126/2026/ligation_deconvolution.png" target="_blank" rel="noopener">Hi-C sequencing quality: digestion/ligation deconvolution (PNG)</a></li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.



Note on runtime: this step can be quick or can take longer depending on load and filesystem performance.
Try to get it launched, then continue reading / looking at the provided plots while it runs (this is a good time for a break).


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.



Create <code>03_parse.sbatch</code> from the template and (important!) update the template to request 10 CPUs
(e.g. change <code>#SBATCH -c 4</code> to <code>#SBATCH -c 10</code>), then paste the commands below at the bottom.


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 03_parse.sbatch
</pre>

<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
ref="/home/people/$USER/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/${sample}"

tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input
</pre>


Submit the job:


<pre>
sbatch 03_parse.sbatch
</pre>

Question: Is it possible to retrieve multiple contacting regions?
Answer: Consider complex ligation products (read pairs mapping to different fragments in the same molecule, i.e., multiple contacts) and multi-mapping artifacts; TADbit focuses on valid pairs as operationally defined by the filters. Multi-contact methods (e.g., Pore-C, SPRITE) address this explicitly, but standard Hi-C largely models binary contacts per ligation event. We can view it on the bam file in the next step.

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.
To see what the filter numbers mean, check:


<pre>
tadbit filter --help
</pre>


Run filtering (you can run in an sbatch script or interactively):


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/${sample}"

tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
cd tadbit_dirs/$sample

tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Answer: Check the “valid pairs” section of <code>tadbit describe</code> after filtering to get the exact count and percentage.

Question: The total number of filtered reads is not equal to the initial number of reads… Why?
Answer: Because a read pair can be assigned to more than one category (e.g., a dangling end that is also a duplicate). Categories are not mutually exclusive, so percentages can overlap.

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.
If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.



Create an sbatch script (e.g. <code>04_norm_vanilla.sbatch</code>) if you like, then paste the commands below at the bottom.


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 04_norm_vanilla.sbatch
</pre>

<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

sample="liver"
wd="tadbit_dirs/${sample}"

# Define the resolution
res="100000" # 100 kb

# Normalization method
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100

tadbit normalize \
-w ${wd} \
-r ${res} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


Submit it:


<pre>
sbatch 04_norm_vanilla.sbatch
</pre>


Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Optional reading on normalization strategies:



<a href="https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105" target="_blank" rel="noopener">
Overview article on Hi-C normalization strategies
</a>


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.
This step can take a while, so using an sbatch script is recommended.


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 05_bin_chr1.sbatch
</pre>

<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/${sample}"
res="100000"
chrom="chr1"
norm="Vanilla"

tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm ${norm} \
--format "png" \
--cpus 6
</pre>


Submit it:


<pre>
sbatch 05_bin_chr1.sbatch
</pre>


Example matrices (open locally):


<ul>
<li><a href="https://teaching.healthtech.dtu.dk/material/22126/2026/Raw_HiC.png" target="_blank" rel="noopener">Raw Hi-C matrix (PNG)</a></li>
<li><a href="https://teaching.healthtech.dtu.dk/material/22126/2026/Normal_HiC.png" target="_blank" rel="noopener">Normalized Hi-C matrix (PNG)</a></li>
</ul>

Congratulations, you finished the exercise!

Denovo exercise

2026-01-08T10:26:24Z

Mick:

<h2>Overview</h2>

First:
<ol>
<li>Navigate to your home directory.</li>
<li>Create a directory called <code>denovo</code>.</li>
<li>Navigate to the directory you just created.</li>
</ol>

In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:

<ol>
<li>Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).</li>
<li>Count k-mers and estimate genome size.</li>
<li>Correct reads using Musket.</li>
<li>Determine insert size of paired-end reads.</li>
<li>Run de novo assembly using MEGAHIT.</li>
<li>Calculate assembly statistics.</li>
<li>Plot coverage and length histograms of the assembly.</li>
<li>Evaluate the assembly quality.</li>
<li>Visualize the assembly using Circoletto.</li>
<li>(Bonus) Try assembling the genome with SPAdes.</li>
<li>Annotation of a prokaryotic genome.</li>
</ol>

<hr>

<h3>FastQC and trimming</h3>

Make sure you are in the <code>denovo</code> directory you created. You can double-check with:

<pre>
pwd
</pre>

Copy the sequencing data:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/* .
</pre>

Run FastQC on the reads:

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

Viewing FastQC HTML reports:

If you are using MobaXterm, you can open the FastQC HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_1_sequence_fastqc.html .
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_2_sequence_fastqc.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:

<pre>
/home/ctools/bin/fastx_detect_fq.sh Vchol-001_6_1_sequence.txt.gz
</pre>

Q1. Which quality encoding format is used?

Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The <code>--basename</code> option defines the output prefix and <code>--gzip</code> compresses the output.

<pre>
/home/ctools/adapterremoval-2.3.4/build/AdapterRemoval \
--file1 Vchol-001_6_1_sequence.txt.gz \
--file2 Vchol-001_6_2_sequence.txt.gz \
--adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
--adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
--qualitybase 64 \
--basename Vchol-001_6 \
--gzip \
--trimqualities \
--minquality 20 \
--minlength 40
</pre>

When it finishes, inspect <code>Vchol-001_6.settings</code> for trimming statistics (how many reads were trimmed, discarded, etc.).

Q1A. The output includes <code>discarded.gz</code>, <code>pair1.truncated.gz</code>, <code>pair2.truncated.gz</code>, and <code>singleton.truncated.gz</code>. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)

Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:

<pre>
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair1.truncated.gz
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair2.truncated.gz
</pre>

<hr>

<h3>Genome size estimation</h3>

We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram. (This may take some time to run so could be good time to practice using "screen")

Manual: [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]

<pre>
gzip -dc Vchol-001_6.pair*.truncated.gz \
| /home/ctools/jellyfish-2.3.1/bin/jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0

/home/ctools/jellyfish-2.3.1/bin/jellyfish histo Vchol-001 > Vchol-001.histo
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
dat <- read.table("Vchol-001.histo")

pdf("Vchol-001.histo.pdf")
barplot(dat[,2],
xlim = c(0,150),
ylim = c(0,5e5),
ylab = "No of kmers",
xlab = "Counts of a k-mer",
names.arg = dat[,1],
cex.names = 0.8)
dev.off()
</pre>

If you are using MobaXterm, you can open the pdf files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/Vchol-001.histo.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

The plot shows:
<ul>
<li>x-axis: how many times a k-mer occurs (its count)</li>
<li>y-axis: number of distinct k-mers with that count</li>
</ul>

K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.

Q2. Where is the k-mer coverage peak (approximately)?

We can estimate genome size using:

<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>

<ul>
<li>N = depth (coverage)</li>
<li>M = k-mer peak (from the histogram)</li>
<li>K = k-mer size (here: 15)</li>
<li>L = average read length (from fastx_readlength)</li>
<li>T = total number of bases (from fastx_readlength)</li>
</ul>

Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.

Q3. What is your estimated genome size?

<hr>

<h3>Error correction</h3>

We will correct errors in the reads using Musket.

Musket: [http://musket.sourceforge.net/homepage.htm Musket]

First, get the number of distinct k-mers (needed for memory allocation in Musket):

<pre>
/home/ctools/jellyfish-2.3.1/bin/jellyfish stats Vchol-001
</pre>

Use the reported number of distinct k-mers (here an example: <code>8423098</code>) in the Musket command:

<pre>
/home/ctools/musket-1.1/musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
</pre>

The output files are named <code>Vchol-001_6.cor.0</code> and <code>Vchol-001_6.cor.1</code>. Rename them:

<pre>
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz
</pre>

If this takes too long, you can copy precomputed corrected reads:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .
</pre>

<hr>

<h3>De novo assembly with MEGAHIT</h3>

We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.

First, set the number of threads:

<pre>
export OMP_NUM_THREADS=4
</pre>

Run MEGAHIT with k=35:

<pre>
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list 35 \
-t 4 \
-m 2000000000 \
-o 35
</pre>

When finished, you should have <code>35/final.contigs.fa</code>. Compress it:

<pre>
gzip 35/final.contigs.fa
</pre>

To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):

<pre>
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
</pre>

Index the assembly and map:

<pre>
bwa index 35/final.contigs.fa.gz

bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
| samtools view -Sb - > Vchol_35bp.bam
</pre>

Extract insert sizes (TLEN field, column 9):

<pre>
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean insert size
sd(a.v[a.v >= mn & a.v <= mx]) # standard deviation
</pre>

Q4. What are the mean insert size and standard deviation of the library?

Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:

[https://docs.google.com/spreadsheets/d/1trUMlSwNLoNW67D-OkgA93iOQRp2iioyJSBYyW30P4U/edit?usp=sharing Google sheet for k-mer assignment]

Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing <code>[KMER]</code> below:

<pre>
export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list [KMER] \
-t 4 \
-m 2000000000 \
-o [KMER]

gzip [KMER]/final.contigs.fa
</pre>

Compute assembly statistics using <code>QUAST</code> - note: quast does not consider contigs smaller than 500bp:

<pre>
python3 /home/ctools/quast/quast.py \
[KMER]/final.contigs.fa.gz \
--threads 1 \
-o [KMER]/quast
</pre>

Open the file <code>[KMER]/quast/report.txt</code> (or <code>report.tsv</code>) and
record the following values in the Google sheet for your k-mer:

<ul>
<li>Number of contigs (≥ 500 bp)</li>
<li>Total assembly length</li>
<li>Largest contig</li>
<li>N50</li>
</ul>

As a class, compare results across k-mer sizes and discuss which k-mer produces
the most reasonable assembly and why.

Copy the best assembly to your folder, or use a precomputed multi-k assembly:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .
</pre>

Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?

Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?

<hr>

<h3>Coverage of the assembly</h3>

We will now calculate per-contig coverage and lengths, and visualize them in R.

<pre>
zcat default_final.contigs.fa.gz | /home/ctools/bin/fastx_megahit.sh --i /dev/stdin > default_finalt.cov
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
library(plotrix)
dat <- read.table("default_finalt.cov", sep = "\t")

## ---- Coverage plots (2 panels) ----
pdf("best.coverage.pdf", width = 10, height = 5)
par(mfrow = c(1, 2))

weighted.hist(w = dat[,2],
x = dat[,1],
breaks = seq(0, 100, 1),
main = "Weighted coverage",
xlab = "Contig coverage")

hist(dat[,1],
xlim = c(0, 100),
breaks = seq(0, 1000, 1),
main = "Raw coverage",
xlab = "Contig coverage")

dev.off()

## ---- Scaffold lengths (1 panel) ----
pdf("scaffold.lengths.pdf", width = 7, height = 5)
par(mfrow = c(1, 1))

barplot(rev(sort(dat[,2])),
xlab = "# Scaffold",
ylab = "Length",
main = "Scaffold Lengths")

dev.off()
</pre>

View the plots:

Viewing the PDF files:

If you are using MobaXterm, you can open the PDF files directly from the
left-hand file panel.

If you are using macOS (or a standard terminal), copy the PDF files to your
local computer and open them with any PDF viewer. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/best.coverage.pdf .
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/scaffold.lengths.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.

The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.

Q7. Why might some short contigs have much higher coverage than the rest of the assembly?

Q8. Why might some short contigs have much lower coverage than the rest of the assembly?

<hr>

<h3>Assembly evaluation</h3>

We will use QUAST to evaluate the assembly using various reference-based metrics.

QUAST: [https://quast.sourceforge.net/quast quast]

Run QUAST against the V. cholerae reference genome:

<pre>
python3 /home/ctools/quast/quast.py \
default_final.contigs.fa.gz \
--threads 1 \
-R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
</pre>

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

If you are using MobaXterm, you can open the HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/quast_results/latest/report.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?

<hr>

<h3>Visualization using Circoletto</h3>

We will visualize the assembly against the V. cholerae reference using Circoletto.

First, filter out contigs shorter than 500 bp:

<pre>
/home/ctools/bin/fastx_filterfasta.sh default_final.contigs.fa.gz 500 > default_final.contigs_filtered_500.fa
</pre>

On your local machine, open a browser and go to:

[https://bat.infspire.org/circoletto/ Circoletto]

Open the filtered assembly in a text editor on the server, for example:

<pre>
gedit default_final.contigs_filtered_500.fa &
</pre>

Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.

Then open the reference genome:

<pre>
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
</pre>

Copy–paste this into the “Subject fasta” box.

In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.

If Circoletto does not work, you can use this precomputed image:

<pre>
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
</pre>

You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).

Q10. Does your assembled genome appear broadly similar to the reference genome?

Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?

Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: [https://www.nature.com/articles/35020000 V. cholerae genome paper]

<hr>

<h3>Try to assemble the genome using SPAdes (bonus)</h3>

Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.

SPAdes: [https://ablab.github.io/spades/ SPAdes]

Check the help output:

<pre>
python3 /home/ctools/SPAdes-4.2.0-Linux/bin/spades.py -h
</pre>

Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.

Link to the SPAdes assembly:

<pre>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST
</pre>

<h3>Annotation of a prokaryotic genome</h3>

We will annotate genes in <code>/home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta</code> using prodigal.

Prodigal: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal]

The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:

<pre>
/home/ctools/Prodigal/prodigal \
-f gff \
-i [input genome in fasta] \
-a [output proteins in fasta] \
-o [output annotations in gff]
</pre>

GFF format: [https://www.ensembl.org/info/website/upload/gff.html GFF format description]

Next, index the protein FASTA file:

<pre>
samtools faidx ecoli_pacbio.contigs.aa
</pre>

Extract the protein sequence for gene ID <code>tig00000001_4582</code>:

<pre>
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
</pre>

Use BLASTP against the nr database:

[https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins]

Paste the sequence and run BLASTP.

Q14. Which protein (function) does <code>tig00000001_4582</code> correspond to?

<hr>

Please find answers here: [[Denovo_solution|Denovo_solution]]

<hr>

Congratulations, you finished the exercise!

Denovo exercise

2026-01-08T10:20:00Z

Mick:

<h2>Overview</h2>

First:
<ol>
<li>Navigate to your home directory.</li>
<li>Create a directory called <code>denovo</code>.</li>
<li>Navigate to the directory you just created.</li>
</ol>

In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:

<ol>
<li>Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).</li>
<li>Count k-mers and estimate genome size.</li>
<li>Correct reads using Musket.</li>
<li>Determine insert size of paired-end reads.</li>
<li>Run de novo assembly using MEGAHIT.</li>
<li>Calculate assembly statistics.</li>
<li>Plot coverage and length histograms of the assembly.</li>
<li>Evaluate the assembly quality.</li>
<li>Visualize the assembly using Circoletto.</li>
<li>(Bonus) Try assembling the genome with SPAdes.</li>
<li>Annotation of a prokaryotic genome.</li>
</ol>

<hr>

<h3>FastQC and trimming</h3>

Make sure you are in the <code>denovo</code> directory you created. You can double-check with:

<pre>
pwd
</pre>

Copy the sequencing data:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/* .
</pre>

Run FastQC on the reads:

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

Viewing FastQC HTML reports:

If you are using MobaXterm, you can open the FastQC HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_1_sequence_fastqc.html .
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_2_sequence_fastqc.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:

<pre>
/home/ctools/bin/fastx_detect_fq.sh Vchol-001_6_1_sequence.txt.gz
</pre>

Q1. Which quality encoding format is used?

Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The <code>--basename</code> option defines the output prefix and <code>--gzip</code> compresses the output.

<pre>
/home/ctools/adapterremoval-2.3.4/build/AdapterRemoval \
--file1 Vchol-001_6_1_sequence.txt.gz \
--file2 Vchol-001_6_2_sequence.txt.gz \
--adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
--adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
--qualitybase 64 \
--basename Vchol-001_6 \
--gzip \
--trimqualities \
--minquality 20 \
--minlength 40
</pre>

When it finishes, inspect <code>Vchol-001_6.settings</code> for trimming statistics (how many reads were trimmed, discarded, etc.).

Q1A. The output includes <code>discarded.gz</code>, <code>pair1.truncated.gz</code>, <code>pair2.truncated.gz</code>, and <code>singleton.truncated.gz</code>. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)

Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:

<pre>
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair1.truncated.gz
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair2.truncated.gz
</pre>

<hr>

<h3>Genome size estimation</h3>

We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram. (This may take some time to run so could be good time to practice using "screen")

Manual: [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]

<pre>
gzip -dc Vchol-001_6.pair*.truncated.gz \
| /home/ctools/jellyfish-2.3.1/bin/jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0

/home/ctools/jellyfish-2.3.1/bin/jellyfish histo Vchol-001 > Vchol-001.histo
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
dat <- read.table("Vchol-001.histo")

pdf("Vchol-001.histo.pdf")
barplot(dat[,2],
xlim = c(0,150),
ylim = c(0,5e5),
ylab = "No of kmers",
xlab = "Counts of a k-mer",
names.arg = dat[,1],
cex.names = 0.8)
dev.off()
</pre>

If you are using MobaXterm, you can open the pdf files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/Vchol-001.histo.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

The plot shows:
<ul>
<li>x-axis: how many times a k-mer occurs (its count)</li>
<li>y-axis: number of distinct k-mers with that count</li>
</ul>

K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.

Q2. Where is the k-mer coverage peak (approximately)?

We can estimate genome size using:

<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>

<ul>
<li>N = depth (coverage)</li>
<li>M = k-mer peak (from the histogram)</li>
<li>K = k-mer size (here: 15)</li>
<li>L = average read length (from fastx_readlength)</li>
<li>T = total number of bases (from fastx_readlength)</li>
</ul>

Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.

Q3. What is your estimated genome size?

<hr>

<h3>Error correction</h3>

We will correct errors in the reads using Musket.

Musket: [http://musket.sourceforge.net/homepage.htm Musket]

First, get the number of distinct k-mers (needed for memory allocation in Musket):

<pre>
/home/ctools/jellyfish-2.3.1/bin/jellyfish stats Vchol-001
</pre>

Use the reported number of distinct k-mers (here an example: <code>8423098</code>) in the Musket command:

<pre>
/home/ctools/musket-1.1/musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
</pre>

The output files are named <code>Vchol-001_6.cor.0</code> and <code>Vchol-001_6.cor.1</code>. Rename them:

<pre>
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz
</pre>

If this takes too long, you can copy precomputed corrected reads:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .
</pre>

<hr>

<h3>De novo assembly with MEGAHIT</h3>

We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.

First, set the number of threads:

<pre>
export OMP_NUM_THREADS=4
</pre>

Run MEGAHIT with k=35:

<pre>
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list 35 \
-t 4 \
-m 2000000000 \
-o 35
</pre>

When finished, you should have <code>35/final.contigs.fa</code>. Compress it:

<pre>
gzip 35/final.contigs.fa
</pre>

To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):

<pre>
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
</pre>

Index the assembly and map:

<pre>
bwa index 35/final.contigs.fa.gz

bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
| samtools view -Sb - > Vchol_35bp.bam
</pre>

Extract insert sizes (TLEN field, column 9):

<pre>
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean insert size
sd(a.v[a.v >= mn & a.v <= mx]) # standard deviation
</pre>

Q4. What are the mean insert size and standard deviation of the library?

Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:

[https://docs.google.com/spreadsheets/d/1trUMlSwNLoNW67D-OkgA93iOQRp2iioyJSBYyW30P4U/edit?usp=sharing Google sheet for k-mer assignment]

Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing <code>[KMER]</code> below:

<pre>
export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list [KMER] \
-t 4 \
-m 2000000000 \
-o [KMER]

gzip [KMER]/final.contigs.fa
</pre>

Compute assembly statistics using <code>QUAST</code> - note: quast does not consider contigs smaller than 500bp:

<pre>
python3 /home/ctools/quast/quast.py \
[KMER]/final.contigs.fa.gz \
--threads 1 \
-o [KMER]/quast
</pre>

Open the file <code>[KMER]/quast/report.txt</code> (or <code>report.tsv</code>) and
record the following values in the Google sheet for your k-mer:

<ul>
<li>Number of contigs (≥ 500 bp)</li>
<li>Total assembly length</li>
<li>Largest contig</li>
<li>N50</li>
</ul>

As a class, compare results across k-mer sizes and discuss which k-mer produces
the most reasonable assembly and why.

Copy the best assembly to your folder, or use a precomputed multi-k assembly:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .
</pre>

Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?

Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?

<hr>

<h3>Coverage of the assembly</h3>

We will now calculate per-contig coverage and lengths, and visualize them in R.

<pre>
zcat default_final.contigs.fa.gz | /home/ctools/bin/fastx_megahit.sh --i /dev/stdin > default_finalt.cov
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
library(plotrix)
dat <- read.table("default_finalt.cov", sep = "\t")

## ---- Coverage plots (2 panels) ----
pdf("best.coverage.pdf", width = 10, height = 5)
par(mfrow = c(1, 2))

weighted.hist(w = dat[,2],
x = dat[,1],
breaks = seq(0, 100, 1),
main = "Weighted coverage",
xlab = "Contig coverage")

hist(dat[,1],
xlim = c(0, 100),
breaks = seq(0, 1000, 1),
main = "Raw coverage",
xlab = "Contig coverage")

dev.off()

## ---- Scaffold lengths (1 panel) ----
pdf("scaffold.lengths.pdf", width = 7, height = 5)
par(mfrow = c(1, 1))

barplot(rev(sort(dat[,2])),
xlab = "# Scaffold",
ylab = "Length",
main = "Scaffold Lengths")

dev.off()
</pre>

View the plots:

Viewing the PDF files:

If you are using MobaXterm, you can open the PDF files directly from the
left-hand file panel.

If you are using macOS (or a standard terminal), copy the PDF files to your
local computer and open them with any PDF viewer. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/best.coverage.pdf .
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/scaffold.lengths.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.

The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.

Q7. Why might some short contigs have much higher coverage than the main assembly?

Q8. Why might some short contigs have much lower coverage than the main assembly?

<hr>

<h3>Assembly evaluation</h3>

We will use QUAST to evaluate the assembly using various reference-based metrics.

QUAST: [https://quast.sourceforge.net/quast quast]

Run QUAST against the V. cholerae reference genome:

<pre>
python3 /home/ctools/quast/quast.py \
default_final.contigs.fa.gz \
--threads 1 \
-R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
</pre>

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

If you are using MobaXterm, you can open the HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/quast_results/latest/report.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?

<hr>

<h3>Visualization using Circoletto</h3>

We will visualize the assembly against the V. cholerae reference using Circoletto.

First, filter out contigs shorter than 500 bp:

<pre>
/home/ctools/bin/fastx_filterfasta.sh default_final.contigs.fa.gz 500 > default_final.contigs_filtered_500.fa
</pre>

On your local machine, open a browser and go to:

[https://bat.infspire.org/circoletto/ Circoletto]

Open the filtered assembly in a text editor on the server, for example:

<pre>
gedit default_final.contigs_filtered_500.fa &
</pre>

Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.

Then open the reference genome:

<pre>
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
</pre>

Copy–paste this into the “Subject fasta” box.

In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.

If Circoletto does not work, you can use this precomputed image:

<pre>
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
</pre>

You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).

Q10. Does your assembled genome appear broadly similar to the reference genome?

Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?

Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: [https://www.nature.com/articles/35020000 V. cholerae genome paper]

<hr>

<h3>Try to assemble the genome using SPAdes (bonus)</h3>

Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.

SPAdes: [https://ablab.github.io/spades/ SPAdes]

Check the help output:

<pre>
python3 /home/ctools/SPAdes-4.2.0-Linux/bin/spades.py -h
</pre>

Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.

Link to the SPAdes assembly:

<pre>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST
</pre>

<h3>Annotation of a prokaryotic genome</h3>

We will annotate genes in <code>/home/projects/22126_NGS/exercises/denovo/canu/ecoli_pacbio.contigs.fasta</code> using prodigal.

Prodigal: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal]

The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:

<pre>
/home/ctools/Prodigal/prodigal \
-f gff \
-i [input genome in fasta] \
-a [output proteins in fasta] \
-o [output annotations in gff]
</pre>

GFF format: [https://www.ensembl.org/info/website/upload/gff.html GFF format description]

Next, index the protein FASTA file:

<pre>
samtools faidx ecoli_pacbio.contigs.aa
</pre>

Extract the protein sequence for gene ID <code>tig00000001_4582</code>:

<pre>
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
</pre>

Use BLASTP against the nr database:

[https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins]

Paste the sequence and run BLASTP.

Q14. Which protein (function) does <code>tig00000001_4582</code> correspond to?

<hr>

Please find answers here: [[Denovo_solution|Denovo_solution]]

<hr>

Congratulations, you finished the exercise!

Exercise and answers

2026-01-08T09:23:36Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit later</h2> You will get notified something is missing, just accept

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>


Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


Submit the job:


<pre>
sbatch 00_index.sbatch;
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.


We can make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>


⏰ It should take ~5–10 min to complete.



A prepared script is also available:


<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Copy the template and create <code>01_fastp.sbatch</code>:


<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


Put the following into the SBATCH script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Store the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# Trim first 5 bases (often lower quality)
--trim_front1 5 \
# Threads
-w 10 \
# Minimal read length (remove reads shorter than this after trimming)
-l 30 \
-h ${sample}.html
</pre>


Copy the HTML report to your local computer and open it in a browser:


<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete with 6 CPUs.


Question: Check the HTML report. What percentage of reads are kept?

Answer: It should be about ~96.4%. No massive adapter content or low quality sequences. After mapping we will inspect ligation/digestion patterns in more detail.

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Put the following into your mapping script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI" # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


⏰ It should take ~5 min to complete with 6 CPUs.



Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️
Answer: Cutting frequency differs between 4‑cutters and 6‑cutters and influences fragment size distribution, ligation probabilities, and contact resolution. Using two enzymes increases the diversity of ligation junctions. Compare with Micro‑C, which uses MNase digestion, so it cuts evenly through the genome.
[https://teaching.healthtech.dtu.dk/material/22126/2026/Fragment_histogram.pdf Fragment size histogram]
[https://teaching.healthtech.dtu.dk/material/22126/2026/ligation_deconvolution.png HiC Sequencing Quality and digestion - ligation deconvolution]

<ul>
<li>Fragment size histogram</li>
<li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
</pre>


⏰ It should take ~35 min to complete with 10 CPUs.


Question: Is it possible to retrieve multiple contacting regions?
Answer: Consider complex ligation products (read pairs mapping to different fragments in the same molecule, i.e., multiple contacts) and multi‑mapping artifacts; TADbit focuses on valid pairs as operationally defined by the filters. Multi‑contact methods (e.g., Pore‑C, SPRITE) address this explicitly, but standard Hi‑C largely models binary contacts per ligation event. We can view it on the bam file in the next step.

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.


Run filtering:

<pre>
tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6 \
--tmpdb ${TMPDIR}
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Answer:Check the “valid pairs” section of tadbit describe after filtering to get the exact count and percentage regarding the initial read pairs.
Question: The total number of filtered reads is not equal to the initial number of reads… Why?
Answer:Because a read pair can be assigned to more than one category (e.g., a dangling end that is also a duplicate). Categories are not mutually exclusive, so percentages can overlap.

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.



If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000" # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
</pre>

<pre>
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


⏰ It should take ~2 min to complete with 6 CPUs.



⚠️ Run another version with <code>norm="raw"</code> to compare later.



Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Each job is assigned a <code><job_id></code>. This helps retrieve results from specific runs (especially when testing parameters).



If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:



https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.


<pre>
# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>

<pre>
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;
</pre>

<hr>


[https://teaching.healthtech.dtu.dk/material/22126/2026/Raw_HiC.png Raw HiC matrix]


[https://teaching.healthtech.dtu.dk/material/22126/2026/Normal_HiC.png Normalized HiC matrix]


Congratulations, you finished the exercise!

Exercise and answers

2026-01-08T09:18:15Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit later</h2> You will get notified something is missing, just accept

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>


Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


Submit the job:


<pre>
sbatch 00_index.sbatch;
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.


We can make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>


⏰ It should take ~5–10 min to complete.



A prepared script is also available:


<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Copy the template and create <code>01_fastp.sbatch</code>:


<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


Put the following into the SBATCH script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Store the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# Trim first 5 bases (often lower quality)
--trim_front1 5 \
# Threads
-w 10 \
# Minimal read length (remove reads shorter than this after trimming)
-l 30 \
-h ${sample}.html
</pre>


Copy the HTML report to your local computer and open it in a browser:


<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete with 6 CPUs.


Question: Check the HTML report. What percentage of reads are kept?

Answer: It should be about ~96.4%. No massive adapter content or low quality sequences. After mapping we will inspect ligation/digestion patterns in more detail.

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Put the following into your mapping script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI" # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


⏰ It should take ~5 min to complete with 6 CPUs.



Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️
Answer: Cutting frequency differs between 4‑cutters and 6‑cutters and influences fragment size distribution, ligation probabilities, and contact resolution. Using two enzymes increases the diversity of ligation junctions. Compare with Micro‑C, which uses MNase digestion, so it cuts evenly through the genome.
[https://teaching.healthtech.dtu.dk/material/22126/2026/Fragment_histogram.pdf Fragment size histogram]
[https://teaching.healthtech.dtu.dk/material/22126/2026/ligation_deconvolution.png HiC Sequencing Quality and digestion - ligation deconvolution]

<ul>
<li>Fragment size histogram</li>
<li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
</pre>


⏰ It should take ~35 min to complete with 10 CPUs.


Question: Is it possible to retrieve multiple contacting regions?
Answer: Consider complex ligation products (read pairs mapping to different fragments in the same molecule, i.e., multiple contacts) and multi‑mapping artifacts; TADbit focuses on valid pairs as operationally defined by the filters. Multi‑contact methods (e.g., Pore‑C, SPRITE) address this explicitly, but standard Hi‑C largely models binary contacts per ligation event. We can view it on the bam file in the next step.

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.


Run filtering:

<pre>
tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6 \
--tmpdb ${TMPDIR}
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Answer:Check the “valid pairs” section of tadbit describe after filtering to get the exact count and percentage regarding the initial read pairs.
Question: The total number of filtered reads is not equal to the initial number of reads… Why?
Answer:Because a read pair can be assigned to more than one category (e.g., a dangling end that is also a duplicate). Categories are not mutually exclusive, so percentages can overlap.

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.



If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000" # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
</pre>

<pre>
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


⏰ It should take ~2 min to complete with 6 CPUs.



⚠️ Run another version with <code>norm="raw"</code> to compare later.



Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Each job is assigned a <code><job_id></code>. This helps retrieve results from specific runs (especially when testing parameters).



If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:



https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.


<pre>
# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>

<pre>
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;
</pre>

<hr>

Congratulations, you finished the exercise!

Exercise and answers

2026-01-08T09:15:52Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit later</h2> You will get notified something is missing, just accept

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>


Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


Submit the job:


<pre>
sbatch 00_index.sbatch;
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.


We can make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>


⏰ It should take ~5–10 min to complete.



A prepared script is also available:


<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Copy the template and create <code>01_fastp.sbatch</code>:


<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


Put the following into the SBATCH script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Store the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# Trim first 5 bases (often lower quality)
--trim_front1 5 \
# Threads
-w 10 \
# Minimal read length (remove reads shorter than this after trimming)
-l 30 \
-h ${sample}.html
</pre>


Copy the HTML report to your local computer and open it in a browser:


<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete with 6 CPUs.


Question: Check the HTML report. What percentage of reads are kept?

Answer: It should be about ~96.4%. No massive adapter content or low quality sequences. After mapping we will inspect ligation/digestion patterns in more detail.

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Put the following into your mapping script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI" # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


⏰ It should take ~5 min to complete with 6 CPUs.



Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️
Answer: How may restriction enzyme choice influence the experiment? ✂️
Cutting frequency differs between 4‑cutters and 6‑cutters and influences fragment size distribution, ligation probabilities, and contact resolution. Using two enzymes increases the diversity of ligation junctions. Compare with Micro‑C, which uses MNase digestion, so it cuts evenly through the genome.
([https://teaching.healthtech.dtu.dk/material/22126/2026/Fragment_histogram.pdf Fragment size histogram])
([https://teaching.healthtech.dtu.dk/material/22126/2026/ligation_deconvolution.png HiC Sequencing Quality and digestion - ligation deconvolution])

<ul>
<li>Fragment size histogram</li>
<li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
</pre>


⏰ It should take ~35 min to complete with 10 CPUs.


Question: Is it possible to retrieve multiple contacting regions?

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.


Run filtering:

<pre>
tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6 \
--tmpdb ${TMPDIR}
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Question: The total number of filtered reads is not equal to the initial number of reads… Why?

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.



If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000" # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
</pre>

<pre>
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


⏰ It should take ~2 min to complete with 6 CPUs.



⚠️ Run another version with <code>norm="raw"</code> to compare later.



Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Each job is assigned a <code><job_id></code>. This helps retrieve results from specific runs (especially when testing parameters).



If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:



https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.


<pre>
# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>

<pre>
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;
</pre>

<hr>

Congratulations, you finished the exercise!

Denovo exercise

2026-01-08T09:12:47Z

Mick:

<h2>Overview</h2>

First:
<ol>
<li>Navigate to your home directory.</li>
<li>Create a directory called <code>denovo</code>.</li>
<li>Navigate to the directory you just created.</li>
</ol>

In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:

<ol>
<li>Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).</li>
<li>Count k-mers and estimate genome size.</li>
<li>Correct reads using Musket.</li>
<li>Determine insert size of paired-end reads.</li>
<li>Run de novo assembly using MEGAHIT.</li>
<li>Calculate assembly statistics.</li>
<li>Plot coverage and length histograms of the assembly.</li>
<li>Evaluate the assembly quality.</li>
<li>Visualize the assembly using Circoletto.</li>
<li>(Bonus) Try assembling the genome with SPAdes.</li>
<li>Annotation of a prokaryotic genome.</li>
</ol>

<hr>

<h3>FastQC and trimming</h3>

Make sure you are in the <code>denovo</code> directory you created. You can double-check with:

<pre>
pwd
</pre>

Copy the sequencing data:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/* .
</pre>

Run FastQC on the reads:

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

Viewing FastQC HTML reports:

If you are using MobaXterm, you can open the FastQC HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_1_sequence_fastqc.html .
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_2_sequence_fastqc.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:

<pre>
/home/ctools/bin/fastx_detect_fq.sh Vchol-001_6_1_sequence.txt.gz
</pre>

Q1. Which quality encoding format is used?

Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The <code>--basename</code> option defines the output prefix and <code>--gzip</code> compresses the output.

<pre>
/home/ctools/adapterremoval-2.3.4/build/AdapterRemoval \
--file1 Vchol-001_6_1_sequence.txt.gz \
--file2 Vchol-001_6_2_sequence.txt.gz \
--adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
--adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
--qualitybase 64 \
--basename Vchol-001_6 \
--gzip \
--trimqualities \
--minquality 20 \
--minlength 40
</pre>

When it finishes, inspect <code>Vchol-001_6.settings</code> for trimming statistics (how many reads were trimmed, discarded, etc.).

Q1A. The output includes <code>discarded.gz</code>, <code>pair1.truncated.gz</code>, <code>pair2.truncated.gz</code>, and <code>singleton.truncated.gz</code>. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)

Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:

<pre>
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair1.truncated.gz
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair2.truncated.gz
</pre>

<hr>

<h3>Genome size estimation</h3>

We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram. (This may take some time to run so could be good time to practice using "screen")

Manual: [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]

<pre>
gzip -dc Vchol-001_6.pair*.truncated.gz \
| /home/ctools/jellyfish-2.3.1/bin/jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0

/home/ctools/jellyfish-2.3.1/bin/jellyfish histo Vchol-001 > Vchol-001.histo
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
dat <- read.table("Vchol-001.histo")

pdf("Vchol-001.histo.pdf")
barplot(dat[,2],
xlim = c(0,150),
ylim = c(0,5e5),
ylab = "No of kmers",
xlab = "Counts of a k-mer",
names.arg = dat[,1],
cex.names = 0.8)
dev.off()
</pre>

If you are using MobaXterm, you can open the pdf files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/Vchol-001.histo.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

The plot shows:
<ul>
<li>x-axis: how many times a k-mer occurs (its count)</li>
<li>y-axis: number of distinct k-mers with that count</li>
</ul>

K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.

Q2. Where is the k-mer coverage peak (approximately)?

We can estimate genome size using:

<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>

<ul>
<li>N = depth (coverage)</li>
<li>M = k-mer peak (from the histogram)</li>
<li>K = k-mer size (here: 15)</li>
<li>L = average read length (from fastx_readlength)</li>
<li>T = total number of bases (from fastx_readlength)</li>
</ul>

Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.

Q3. What is your estimated genome size?

<hr>

<h3>Error correction</h3>

We will correct errors in the reads using Musket.

Musket: [http://musket.sourceforge.net/homepage.htm Musket]

First, get the number of distinct k-mers (needed for memory allocation in Musket):

<pre>
/home/ctools/jellyfish-2.3.1/bin/jellyfish stats Vchol-001
</pre>

Use the reported number of distinct k-mers (here an example: <code>8423098</code>) in the Musket command:

<pre>
/home/ctools/musket-1.1/musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
</pre>

The output files are named <code>Vchol-001_6.cor.0</code> and <code>Vchol-001_6.cor.1</code>. Rename them:

<pre>
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz
</pre>

If this takes too long, you can copy precomputed corrected reads:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .
</pre>

<hr>

<h3>De novo assembly with MEGAHIT</h3>

We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.

First, set the number of threads:

<pre>
export OMP_NUM_THREADS=4
</pre>

Run MEGAHIT with k=35:

<pre>
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list 35 \
-t 4 \
-m 2000000000 \
-o 35
</pre>

When finished, you should have <code>35/final.contigs.fa</code>. Compress it:

<pre>
gzip 35/final.contigs.fa
</pre>

To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):

<pre>
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
</pre>

Index the assembly and map:

<pre>
bwa index 35/final.contigs.fa.gz

bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
| samtools view -Sb - > Vchol_35bp.bam
</pre>

Extract insert sizes (TLEN field, column 9):

<pre>
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean insert size
sd(a.v[a.v >= mn & a.v <= mx]) # standard deviation
</pre>

Q4. What are the mean insert size and standard deviation of the library?

Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:

[https://docs.google.com/spreadsheets/d/1trUMlSwNLoNW67D-OkgA93iOQRp2iioyJSBYyW30P4U/edit?usp=sharing Google sheet for k-mer assignment]

Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing <code>[KMER]</code> below:

<pre>
export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list [KMER] \
-t 4 \
-m 2000000000 \
-o [KMER]

gzip [KMER]/final.contigs.fa
</pre>

Compute assembly statistics using <code>QUAST</code> - note: quast does not consider contigs smaller than 500bp:

<pre>
python3 /home/ctools/quast/quast.py \
[KMER]/final.contigs.fa.gz \
--threads 1 \
-o [KMER]/quast
</pre>

Open the file <code>[KMER]/quast/report.txt</code> (or <code>report.tsv</code>) and
record the following values in the Google sheet for your k-mer:

<ul>
<li>Number of contigs (≥ 500 bp)</li>
<li>Total assembly length</li>
<li>Largest contig</li>
<li>N50</li>
</ul>

As a class, compare results across k-mer sizes and discuss which k-mer produces
the most reasonable assembly and why.

Copy the best assembly to your folder, or use a precomputed multi-k assembly:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .
</pre>

Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?

Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?

<hr>

<h3>Coverage of the assembly</h3>

We will now calculate per-contig coverage and lengths, and visualize them in R.

<pre>
zcat default_final.contigs.fa.gz | /home/ctools/bin/fastx_megahit.sh --i /dev/stdin > default_finalt.cov
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
library(plotrix)
dat <- read.table("default_finalt.cov", sep = "\t")

## ---- Coverage plots (2 panels) ----
pdf("best.coverage.pdf", width = 10, height = 5)
par(mfrow = c(1, 2))

weighted.hist(w = dat[,2],
x = dat[,1],
breaks = seq(0, 100, 1),
main = "Weighted coverage",
xlab = "Contig coverage")

hist(dat[,1],
xlim = c(0, 100),
breaks = seq(0, 1000, 1),
main = "Raw coverage",
xlab = "Contig coverage")

dev.off()

## ---- Scaffold lengths (1 panel) ----
pdf("scaffold.lengths.pdf", width = 7, height = 5)
par(mfrow = c(1, 1))

barplot(rev(sort(dat[,2])),
xlab = "# Scaffold",
ylab = "Length",
main = "Scaffold Lengths")

dev.off()
</pre>

View the plots:

Viewing the PDF files:

If you are using MobaXterm, you can open the PDF files directly from the
left-hand file panel.

If you are using macOS (or a standard terminal), copy the PDF files to your
local computer and open them with any PDF viewer. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/best.coverage.pdf .
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/scaffold.lengths.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.

The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.

Q7. Why might some short contigs have much higher coverage than the main assembly?

Q8. Why might some short contigs have much lower coverage than the main assembly?

<hr>

<h3>Assembly evaluation</h3>

We will use QUAST to evaluate the assembly using various reference-based metrics.

QUAST: [https://quast.sourceforge.net/quast quast]

Run QUAST against the V. cholerae reference genome:

<pre>
python3 /home/ctools/quast/quast.py \
default_final.contigs.fa.gz \
--threads 1 \
-R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
</pre>

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

If you are using MobaXterm, you can open the HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/quast_results/latest/report.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?

<hr>

<h3>Visualization using Circoletto</h3>

We will visualize the assembly against the V. cholerae reference using Circoletto.

First, filter out contigs shorter than 500 bp:

<pre>
/home/ctools/bin/fastx_filterfasta.sh default_final.contigs.fa.gz 500 > default_final.contigs_filtered_500.fa
</pre>

On your local machine, open a browser and go to:

[https://bat.infspire.org/circoletto/ Circoletto]

Open the filtered assembly in a text editor on the server, for example:

<pre>
gedit default_final.contigs_filtered_500.fa &
</pre>

Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.

Then open the reference genome:

<pre>
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
</pre>

Copy–paste this into the “Subject fasta” box.

In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.

If Circoletto does not work, you can use this precomputed image:

<pre>
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
</pre>

You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).

Q10. Does your assembled genome appear broadly similar to the reference genome?

Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?

Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: [https://www.nature.com/articles/35020000 V. cholerae genome paper]

<hr>

<h3>Try to assemble the genome using SPAdes (bonus)</h3>

Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.

SPAdes: [https://ablab.github.io/spades/ SPAdes]

Check the help output:

<pre>
python3 /home/ctools/SPAdes-4.2.0-Linux/bin/spades.py -h
</pre>

Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.

Link to the SPAdes assembly:

<pre>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST
</pre>

<h3>Annotation of a prokaryotic genome</h3>

We will annotate genes in <code>ecoli_pacbio.contigs.fasta</code> using prodigal.

Prodigal: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal]

The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:

<pre>
prodigal \
-f gff \
-i [input genome in fasta] \
-a [output proteins in fasta] \
-o [output annotations in gff]
</pre>

GFF format: [https://www.ensembl.org/info/website/upload/gff.html GFF format description]

Next, index the protein FASTA file:

<pre>
samtools faidx ecoli_pacbio.contigs.aa
</pre>

Extract the protein sequence for gene ID <code>tig00000001_4582</code>:

<pre>
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
</pre>

Use BLASTP against the nr database:

[https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins]

Paste the sequence and run BLASTP.

Q14. Which protein (function) does <code>tig00000001_4582</code> correspond to?

<hr>

Please find answers here: [[Denovo_solution|Denovo_solution]]

<hr>

Congratulations, you finished the exercise!

Program 2026

2026-01-08T08:53:52Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: TBA ([[ Microbial_genomics_exercise ]]) ([[ solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Questions]]) ([[Solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([])</DD>
<DD>David Duchene</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Exercise]]) ([[Solution]])</DD>
<DD> David Duchene</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([Lecture slides])</DD>
<DD>David Duchene</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Exercises]]) ([[Solution]]) </DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Metabarcoding Exercise]]) ([[Metabarcoding Solution]])</DD>
<DD> Luke Holman</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024.pdf Test 2025])([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024_withA.pdf answers])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-08T08:41:10Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2025/dtu_adna_2025_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: TBA ([[ Microbial_genomics_exercise ]]) ([[ solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Questions]]) ([[Solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([])</DD>
<DD>David Duchene</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Exercise]]) ([[Solution]])</DD>
<DD> David Duchene</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([Lecture slides])</DD>
<DD>David Duchene</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Exercises]]) ([[Solution]]) </DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Metabarcoding Exercise]]) ([[Metabarcoding Solution]])</DD>
<DD> Luke Holman</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024.pdf Test 2025])([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024_withA.pdf answers])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-08T08:34:25Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:30pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:30pm-9:45pm</DT>
<DD>''Break''</DD>

<DT>9:45pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2025/dtu_adna_2025_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: TBA ([[ Microbial_genomics_exercise ]]) ([[ solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Questions]]) ([[Solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([])</DD>
<DD>David Duchene</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Exercise]]) ([[Solution]])</DD>
<DD> David Duchene</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([Lecture slides])</DD>
<DD>David Duchene</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Exercises]]) ([[Solution]]) </DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Metabarcoding Exercise]]) ([[Metabarcoding Solution]])</DD>
<DD> Luke Holman</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024.pdf Test 2025])([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024_withA.pdf answers])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

SNP calling exercise answers part 1

2026-01-07T14:55:07Z

Mick:

<h2>Answers</h2>

<h3>Q1</h3>

First index the gVCF:

<pre>
/home/ctools/htslib-1.20/tabix -f -p vcf NA24694.gvcf.gz
</pre>

Then run HaplotypeCaller:

<pre>
/home/ctools/gatk-4.6.2.0/gatk --java-options "-Xmx10g" HaplotypeCaller \
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
-L chr20 \
-O NA24694.gvcf.gz \
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
-ERC GVCF
</pre>

Then genotype the gVCF:

<pre>
/home/ctools/gatk-4.6.2.0/gatk GenotypeGVCFs \
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-V NA24694.gvcf.gz \
-L chr20 \
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
-O NA24694.vcf.gz
</pre>

Finally, count SNPs using <code>bcftools stats</code>:

<pre>
/home/ctools/bcftools-1.23/bcftools stats NA24694.vcf.gz
</pre>

You should see something like:

<pre>
SN 0 number of SNPs: 75684
</pre>

Answer: There are 75,684 SNPs on chromosome 20 in this sample.

<hr>

<h3>Q2</h3>

Index the VCF if not already indexed:

<pre>
/home/ctools/htslib-1.20/tabix -p vcf NA24694.vcf.gz
</pre>

Then query the 1 Mb region:

<pre>
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l
</pre>

Answer: 1290 variant sites in this region (all variant types).

<hr>

<h3>Q3</h3>

Count only SNPs (exclude indels and multi-allelic sites):

<pre>
/home/ctools/bcftools-1.23/bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l
</pre>

or equivalently:

<pre>
/home/ctools/htslib-1.20/tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
| /home/ctools/bcftools-1.23/bcftools view -H --type snps - \
| wc -l
</pre>

Answer: 956 SNPs in the region.

<hr>

<h3>Q4</h3>

Variant at chr20:32011209

<pre>
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32011209-32011209
</pre>

<pre>
chr20 32011209 rs147652161 G A 264.64 .
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552
GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
</pre>

Interpretation:
<ul>
<li>Genotype: 0/1 (heterozygous G/A)</li>
<li>Allele depth (AD): 15 G and 9 A</li>
<li>Depth (DP): 24</li>
<li>Genotype quality (GQ): 99</li>
<li>Genotype likelihoods (PL): 272,0,533 (het most likely)</li>
</ul>

Variant at chr20:32044279

<pre>
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32044279-32044279
</pre>

<pre>
chr20 32044279 rs4525768 C T 799.06 .
AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99
GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
</pre>

Interpretation:
<ul>
<li>Genotype: 1/1 (homozygous T/T)</li>
<li>Allele depth (AD): 0 C, 21 T</li>
<li>Depth (DP): 21</li>
<li>Genotype quality (GQ): 63</li>
<li>Genotype likelihoods (PL): 813,63,0 (homozygous alt most likely)</li>
</ul>

<hr>

<h3>Q5</h3>

Higher-quality heterozygous SNP:

<pre>
chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
</pre>

Lower-quality heterozygous SNP:

<pre>
chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
</pre>

Why?
<ul>
<li>The first site has much higher depth (21× vs 6×)</li>
<li>Allele balance is more reasonable: 8/13 vs 4/2</li>
<li>The genotype quality (GQ) is much higher: 99 vs 38</li>
<li>Overall likelihoods strongly support the correct genotype</li>
</ul>

Conclusion: More data ⇒ higher confidence.

<hr>

<h3>Q6</h3>

Count SNPs with no dbSNP ID (column 3 = <code>.</code>):

<pre>
/home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
| cut -f 3 \
| grep -v rs \
| wc -l
</pre>

or equivalently:

<pre>
/home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
| cut -f 3 \
| grep "\." \
| wc -l
</pre>

Answer: 17 novel SNPs

<hr>

<h3>Q7</h3>

You found:
<ul>
<li>956 total SNPs in the region (Q3)</li>
<li>17 novel SNPs (Q6)</li>
</ul>

Only ~1.8% of SNPs are novel.

This is expected because:
<ul>
<li>Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.</li>
<li>dbSNP contains >100 million known variants, so most common variation is already catalogued.</li>
<li>Novel variants tend to be rare or extremely rare.</li>
</ul>

Conclusion: The number of novel variants is small and biologically reasonable.

SNP calling exercise answers part 2

2026-01-07T13:24:01Z

Mick:

<h2>Answers</h2>

<h3>Q1 — How many sites were filtered out?</h3>

After running:

<pre>
/home/ctools/gatk-4.6.2.0/gatk VariantFiltration \
-V NA24694.vcf.gz \
-O NA24694_hf.vcf.gz \
-filter "DP < 10.0" --filter-name "DP" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40"
</pre>

Count all filtered variants (non-PASS):

<pre>
/home/ctools/bcftools-1.23/bcftools view -H NA24694_hf.vcf.gz | grep -v PASS | wc -l
</pre>

4005 sites were filtered out (all variant types).

Count only SNPs filtered out:

<pre>
/home/ctools/bcftools-1.23/bcftools view -H --type=snps NA24694_hf.vcf.gz | grep -v PASS | wc -l
</pre>

2630 SNPs were filtered out.

<hr>

<h3>Q2 — Which filter removed the most sites?</h3>

One approach:

<pre>
/home/ctools/bcftools-1.23/bcftools view -H NA24694_hf.vcf.gz \
| grep -v PASS \
| cut -f7 \
| sort \
| uniq -c \
| sort -n
</pre>

This produces:

<pre>
5 FS60;SOR3
30 DP;MQ40;SOR3
74 MQ40;SOR3
158 DP;SOR3
197 DP;MQ40
390 MQ40
1340 SOR3
1811 DP
</pre>

Depth filter (DP) removed the most sites, followed by SOR (strand bias) and MQ (mapping quality).

<hr>

<h3>Q3 — How many sites remain after applying the mappability filter?</h3>

First extract sites that passed all hard filters:

<pre>
/home/ctools/bcftools-1.23/bcftools view -f PASS NA24694_hf.vcf.gz \
| /home/ctools/htslib-1.20/bgzip -c > NA24694_hf_pass.vcf.gz
</pre>

Count variants that passed all filters:

<pre>
/home/ctools/bcftools-1.23/bcftools view -H NA24694_hf_pass.vcf.gz | wc -l
</pre>

88,594 total variants (SNPs + indels + multi-allelic sites).

Now retain only variants in high-mappability regions:

<pre>
bedtools intersect -header \
-a NA24694_hf_pass.vcf.gz \
-b /home/databases/databases/GRCh38/filter99.bed.gz \
| /home/ctools/htslib-1.20/bgzip -c > NA24694_hf_map99.vcf.gz
</pre>

Count remaining sites:

<pre>
/home/ctools/bcftools-1.23/bcftools view -H NA24694_hf_map99.vcf.gz | wc -l
</pre>

51,624 variants remain after mappability filtering.

<hr>

<h3>Q4 — Most common genomic category (snpEff)</h3>

Run snpEff:

<pre>
java -jar /home/ctools/snpEff/snpEff.jar eff \
-dataDir /home/databases/databases/snpEff/ \
-htmlStats NA24694_hf.html \
GRCh38.99 \
NA24694_hf.vcf.gz \
| /home/ctools/htslib-1.20/bgzip -c > NA24694_hf_ann.vcf.gz
</pre>

From the snpEff HTML output:

Intron variants = 64.368%

Answer: Most variants fall in intronic regions.

<hr>

<h3>Q5 — Variants that cause a codon change</h3>

From the snpEff HTML report:

MISSENSE = 584 variants (44.242%)

Answer: 584 missense mutations are predicted to change a codon and alter the protein sequence.

<hr>

End of Answers

Rnaseq exercise

2026-01-07T11:48:34Z

Mick:

<div class="page-content has-page-title">
<div id="overview-and-background" class="section level1">
<h1>Overview and background</h1>
<div id="groups" class="section level2">
<h2>Groups</h2>
Please get into groups of 2-3. We don’t have enough computational power for all of you working alone. Please let the instructors know if you need help finding a group.
</div>

<div id="assignment-notes" class="section level2">
<h2>Assignment notes</h2>
While some questions might seem hard we naturally don’t ask questions/tasks which you have not been given the tools to solve in this assignment - so if you are stuck try thinking about what you have already learned before asking an instructor.
</div>

<div id="assignment-overview" class="section level2">
<h2>Assignment overview</h2>
In this assignment you are going to analyze RNA-sequencing data from real cancer patients to analyze the importance of alternative splicing in a clinical context
</div>

<div id="biological-background" class="section level2">
<h2>Biological background</h2>
Today you will be working with colorectal cancers - specifically Colon Adenocarcinoma (often abbreviated COAD). It is a cancer of the colon that is very frequent. The lifetime risk of developing
colorectal cancer is ~4% for both males and females. That means COAD represents ~10% of all cancers and results in the death of hundreds of thousands of people each year! (More info on COAD can be found on [https://en.wikipedia.org/wiki/Colorectal_cancer Wikipedia].

One important aspect of cancer is that tumors from different patients are extremely different even when they originate from the same tissue (more info on tumor heterogeneity [https://en.wikipedia.org/wiki/Tumour_heterogeneity here]). To improve treatment and prognosis we therefore try to classify COAD into cancer subtypes (a simple form of precision medicine). We currently think there are 5 subtypes (see [https://www.cell.com/cancer-cell/pdf/S1535-6108(18)30114-4.pdf Liu ''et al.'']) and today you will be working with CIN and GS. CIN is an abbreviation for Chromosomal INstable and GS means genome stable. More on that later.

To help us understand COAD subtypes you will today compare these to healthy adjacent tissue. For all samples a biopsy was taken and bulk RNA-seq performed. Low-quality samples have been removed.

</div>
<div id="bioinformatic-background" class="section level2">
<h2>Bioinformatic background</h2>
For background on transcriptomics and splicing please refer to today’s slides. The data you are working with is a randomly selected a subset of the TCGA COAD data (google TCGA if you want to know more). The data was quantified with Kallisto against the human transcriptome.

Today you will be using the 'pairedGSEA' R package we developed. This package is specifically designed to make it easy to do the following analysis:

<ol style="list-style-type: decimal">
<li>Differential gene expression (aka DGE) via DESeq(2)</li>
<li>Differential gene usage (differential splicing) (aka DGU)</li>
<li>gene-set over-representation analysis (ORA) on DGU and DGE
results</li>
</ol>
While at each step facilitating easy comparison of DGE and DGU.
<hr />
</div>
</div>
<div id="assignment" class="section level1">
<h1>Assignment</h1>
<div id="step-1-determine-which-cancer-to-work-with" class="section level2">
<h2>Step 1: Determine which cancer to work with</h2>
Determine which cancer type you will work with:
<ul>
<li>If your birthday is within the first 6 months of the year (January-June) you will work with CIN.</li>
<li>If your birthday is within the last 6 months of the year (July-December) you will work with GS.</li>
</ul>
</div>
<div id="step-2-set-up-enviroment" class="section level2">
<h2>Step 2: Set up enviroment</h2>
Log into the server as you usually do except this time you have to use the '-X' option. That means using:

<pre>
ssh -X username@pupil1.healthtech.dtu.dk</pre>.


Make a directory for this exercise and move into it
<pre>
mkdir transcriptomics_exercise
cd transcriptomics_exercise
</pre>

Copy the exercise data of your cancer subtype to your folder
<pre>
### for CIN subtype:
cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_cin.Rdata .

### For GS subtype:
cp /home/projects/22126_NGS/exercises/transcriptomics/coad_iso_subset_gs.Rdata .
</pre>

</div>
<div id="step-3-start-r-session-and-enviroment" class="section level2">
<h2>Step 3: Start R session and enviroment</h2>
To start an R session in your terminal typing (or copy/pasting)
<pre>
/home/ctools/opt/R-4.4.2_22140/bin/R
</pre>
And load the library we need by typing
<pre>
library(pairedGSEA)
</pre>

This loads the functionality of the “pairedGSEA” R package.
</div>
<div id="step-4-load-and-inspect-data" class="section level2">
<h2>Step 4: Load and inspect data</h2>
Load the assignment data into your R session:
<pre>
### for CIN subtype:
load('coad_iso_subset_cin.Rdata')

### For GS subtype:
load('coad_iso_subset_gs.Rdata')
</pre>
This will give you two data objects in your R session:
<ol style="list-style-type: decimal">
<li>A count matrix</li>
<li>A matrix with meta information about each sample in the count matrix.</li>
<li>A list of gene_sets that you should use for your ORA analysis (step 7).</li>
</ol>

All objects can be directly used by the 'pairedGSEA'
package - no need to do any data modifications.
 
Use the following functions to take a look at the data:
<pre>
### List objects in an R session
ls()

### Inspect the first lines of the object
head( <object_name> )
</pre>

Question: Which object contains what data?

</div>
<div id="step-5-run-differential-analysis" class="section level2">
<h2>Step 5: Run differential analysis</h2>
Next you will need to use the 'pairedGSEA' package and
here a bit of self-study is needed. Importantly you
should only run this analysis once per group - else we don’t have
enough computational power. You can download the
'pairedGSEA' vignette (short document showing how to use it)
<a href="https://www.dropbox.com/s/oalth29pxulffec/pairedGSEA.html?dl=1">here</a>.
Hints:
<ol style="list-style-type: decimal">
<li>After reading the introduction you can skip to the
'3.3 Running the analysis' section.</li>
<li>For now you only need to use 'paired_diff()' as that
makes both differential analyses (both DGE and DGU).</li>
<li>There is no need to use the “store_results” option</li>
</ol>
Question: This will take a while to run (~10 min).
In the mean time take a closer look at the Liu et al. paper
(see above) and summarise what the difference between the CIN and GS
COAD subtypes are.

</div>
<div id="step-6-inspect-diffrential-result" class="section level2">
<h2>Step 6: Inspect diffrential result</h2>
Question: Look at the first 10 lines of the result
file. Which gene is most significant (smallest p-value) for the DGE and
DGU analysis (respectively DESeq2 and DEXSeq)

The following code example counts how many significantly
differentially expressed genes are found:
<pre>
sum( gi_diff_results$padj_deseq < 0.05, na.rm = T )
</pre>
Question: Modify the R code above to count how many
genes are DGE and DGU.

Question: Use the 'nrow()' function to
calculate the fraction of genes that are DGE and DGU.

Now we are ready to do the gene-set enrichment analysis.
</div>
<div id="step-7-run-gene-set-enrichment-analysis" class="section level2">
<h2>Step 7: Run Gene-Set Enrichment Analysis</h2>
Use the vignette to help you use 'pairedGSEA' to run GSEA on both DGE and DGU results (see the vignette section 4: “Over-Representation Analysis”). You should use the 'gene_set_list' object you have already loaded into R instead of using the 'prepare_msigdb()' function.

Note: There is (again) no need to store the intermediary results.

</div>
<div id="step-8-inspect-ora-result" class="section level2">
<h2>Step 8: Inspect ORA result</h2>
What you have been analyzing so far is a subset of the entire dataset
(since the runtime else would have been 3-4x longer). To enable a more
realistic last step use one of these commands to load
the full results corresponding to what you have been working with.
<pre>
### for CIN subtype:
load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_cin_ora.Rdata')
# loads the "cin_ora" object

### For GS subtype:
load('/home/projects/22126_NGS/exercises/transcriptomics/03_coad_gs_ora.Rdata')
# loads the gs_ora object
</pre>
The following code example extract the ORA analysis of
either DGU and DGE and sorts it so the most significant gene-sets are at
the top.

<pre>
### DGE:
dge_ora_sorted <- gi_paired_ora[
sort.list(gi_paired_ora$pval_deseq), # sort part
c('pathway','pval_deseq','enrichment_score_deseq') # select part
]

### DGU ORA:
dgu_ora_sorted <- gi_paired_ora[
sort.list(gi_paired_ora$pval_dexseq), # sort part
c('pathway','pval_dexseq','enrichment_score_dexseq') # select part
]
</pre>

Question: Look at the 10-15 most significant gene
sets from both analyses. What are the similarities and differences?
</div>
<div id="step-9-visual-inspection-of-ora-result" class="section level2">
<h2>Step 9: Visual inspection of ORA result</h2>
Question: Based on your insights from step 8 use the 'plot_ora()' functionality to test if these are just examples or generalize to all the significant results. An example: If I from the 10-15 top gene-sets saw that only DGU had gene-sets covering “telomer” function I would use the 'plot_ora()' function to test this.
Question: Try to make a hypothesis as to why this/these molecular functions might be important for cancer.

</div>
<div id="step-10-critical-self-evaluation" class="section level2">

<h2>Step 10: Critical self evaluation</h2>
Question: Take a moment to think about what potential problems there could be with this assignment. Are there any obvious things we have not taken into consideration?

</div>
<div id="step-11-repport-result" class="section level2">

<h2>Step 11: Report result</h2>
Go to the blackboard and report one or more of the following:
<ul>
<li>A keyword that showed a similar enrichment pattern in DGU and DGE</li>
<li>A keyword that showed preferential regulation through DGU or DGE</li>
</ul>
<hr/>
</div>
</div>
<div id="bonus-assignment" class="section level1">
<h1>Bonus Assignment</h1>
Use 'pairedGSEA' to analyze the other COAD cancer subtype (the one you did not analyze). Are the gene-sets similar or different between the subtypes and analysis types?
</div>
</div>

SNP calling exercise part 2

2026-01-07T10:59:25Z

Mick:

<h2>Filtering</h2>

We have seen that the VCF contains some low-quality or unreliable variant calls. Before downstream analyses, we generally want to remove poor-quality sites or annotate them so they can be excluded later. In this exercise we explore how to apply hard filters and how to remove variants in regions of poor mappability.

Please use the VCF file generated in Part 1.

<h3>Hard Filtering</h3>

Soft filtering approaches (e.g. VQSR) attempt to statistically learn which variants are “true.” However, these approaches require large cohorts or population-level resources, which may not exist for many organisms or under-sampled populations. For this reason, we often fall back on hard filtering, i.e. applying fixed cutoffs.

Hard filtering is simple but may introduce bias if the filter correlates with variant type (e.g. heterozygous sites often have lower depth). Filters should be chosen thoughtfully.

We will use the following genomic mask file:

<pre>
/home/databases/databases/GRCh38/mask99.bed.gz
</pre>

This file is in the BED interval format, which stores genomic regions as:

<pre>
chromosome start(0-based) end(1-based)
</pre>

<ul>
<li>0-based: first base has coordinate 0</li>
<li>1-based: first base has coordinate 1</li>
</ul>

This mask contains genomic regions to exclude (often low-quality or repetitive regions). Because most genotypers do not recognize duplicated regions, combining hard filtering with mappability filters is best practice.

A typical hard-filtering command using GATK is:

<pre>
/home/ctools/gatk-4.6.2.0/gatk VariantFiltration \
-V [INPUT VCF] \
-O [OUTPUT VCF] \
-filter "DP < 10.0" --filter-name "DP" \
-filter "QUAL < 30.0" --filter-name "QUAL30" \
-filter "SOR > 3.0" --filter-name "SOR3" \
-filter "FS > 60.0" --filter-name "FS60" \
-filter "MQ < 40.0" --filter-name "MQ40"
</pre>

Explanation of filters:

<table class="wikitable">
<tr><th>Filter</th><th>Meaning</th></tr>

<tr>
<td><code>DP < 10</code></td>
<td>Remove sites with <10× coverage</td>
</tr>

<tr>
<td><code>QUAL < 30</code></td>
<td>Remove sites where variant quality <30
(variant QUAL ≠ genotype quality GQ — explanation:
[https://gatk.broadinstitute.org/hc/en-us/articles/360035531392 Variant QUAL vs GQ])</td>
</tr>

<tr>
<td><code>SOR > 3.0</code></td>
<td>Remove sites with strong strand bias
([https://gatk.broadinstitute.org/hc/en-us/articles/360036361772 StrandOddsRatio])</td>
</tr>

<tr>
<td><code>FS > 60</code></td>
<td>Remove variants failing Fisher Strand bias test
([https://gatk.broadinstitute.org/hc/en-us/articles/360036361992 FisherStrand])</td>
</tr>

<tr>
<td><code>MQ < 40</code></td>
<td>Remove sites where reads have low mapping quality</td>
</tr>

</table>

Note: No filter is perfect — you should progressively add filters, evaluate their impact, and ensure that you do not introduce unwanted biases.

<h4>Q1</h4>
How many sites were filtered out?
Sites that pass all filters have <code>PASS</code> in the 7th column. Use <code>grep</code> to count PASS vs non-PASS entries.

<h4>Q2</h4>
The 7th column contains the name(s) of the filters that failed.
Using <code>cut</code>, <code>sort</code>, and <code>uniq -c</code>, determine which filter removed the most sites.

<hr>

<h3>Filtering by Mappability</h3>

Next, we remove variants that fall inside low-mappability regions, because reads cannot be uniquely mapped there and false positives are common.

Use bedtools intersect to retain only variants located in high-mappability intervals (≥99% unique mappability):

<pre>
bedtools intersect -header \
-a [INPUT VCF] \
-b /home/databases/databases/GRCh38/filter99.bed.gz \
| /home/ctools/htslib-1.20/bgzip -c > [OUTPUT VCF]
</pre>

Name your output:

<pre>
NA24694_hf_map99.vcf.gz
</pre>

The "99" refers to the proportion of synthetic reads that map uniquely at that position.

<h4>Q3</h4>
How many variants remain after removing low-mappability regions?

<hr>

<h2>Annotation of Variants</h2>

Next, we examine the genomic context of variants: intronic, exonic, intergenic, UTR, etc. We use snpEff for variant annotation.

<pre>
java -jar /home/ctools/snpEff/snpEff.jar eff \
-dataDir /home/databases/databases/snpEff/ \
-htmlStats [OUTPUT HTML] \
GRCh38.99 \
[INPUT VCF] \
| /home/ctools/htslib-1.20/bgzip -c > [OUTPUT VCF]
</pre>

<ul>
<li><code>-dataDir</code>: location of snpEff databases</li>
<li><code>GRCh38.99</code>: genome version — must match the reference genome you used earlier</li>
</ul>

Run <code>snpEff</code> on your hard-filtered VCF (before mappability filtering).
This produces:

<ul>
<li>HTML report: <code>NA24694_hf.html</code></li>
<li>Annotated VCF: <code>NA24694_hf_ann.vcf.gz</code></li>
</ul>

Viewing the snpEff HTML report:

If you are using MobaXterm, you can open the HTML file directly from the
left-hand file panel.

If you are using macOS (or a standard terminal), copy the HTML file to
your local computer and open it in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/NA24694_hf.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.

<h4>Q4</h4>
Which genomic region category contains the most variants (exon, intron, upstream, downstream, UTR, etc.)?

<h4>Q5</h4>
How many variants are predicted to cause a codon change?
See explanations at: [https://en.wikipedia.org/wiki/Point_mutation Point mutation]

<hr>

Please find answers here:
<a href="SNP_calling_exercise_part_2_answers">SNP_calling_exercise_part_2_answers</a>

Congratulations — you finished the exercise!

Note: When piping <code>bcftools view</code> into other tools, consider specifying the output type using:

<pre>
-O {b|u|z|v}
</pre>

This avoids unnecessary compression/decompression and speeds up workflows.

Exercise and answers

2026-01-07T10:17:33Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit later</h2> You will get notified something is missing, just accept

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>


Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


Submit the job:


<pre>
sbatch 00_index.sbatch;
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.


We can make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>


⏰ It should take ~5–10 min to complete.



A prepared script is also available:


<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Copy the template and create <code>01_fastp.sbatch</code>:


<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


Put the following into the SBATCH script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Store the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# Trim first 5 bases (often lower quality)
--trim_front1 5 \
# Threads
-w 10 \
# Minimal read length (remove reads shorter than this after trimming)
-l 30 \
-h ${sample}.html
</pre>


Copy the HTML report to your local computer and open it in a browser:


<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete with 6 CPUs.


Question: Check the HTML report. What percentage of reads are kept?

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Put the following into your mapping script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI" # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


⏰ It should take ~5 min to complete with 6 CPUs.



Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️

<ul>
<li>Fragment size histogram</li>
<li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
</pre>


⏰ It should take ~35 min to complete with 10 CPUs.


Question: Is it possible to retrieve multiple contacting regions?

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.


Run filtering:

<pre>
tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6 \
--tmpdb ${TMPDIR}
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Question: The total number of filtered reads is not equal to the initial number of reads… Why?

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.



If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000" # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
</pre>

<pre>
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


⏰ It should take ~2 min to complete with 6 CPUs.



⚠️ Run another version with <code>norm="raw"</code> to compare later.



Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Each job is assigned a <code><job_id></code>. This helps retrieve results from specific runs (especially when testing parameters).



If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:



https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.


<pre>
# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>

<pre>
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;
</pre>

<hr>

Congratulations, you finished the exercise!

Exercise and answers

2026-01-07T10:17:00Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit later You will get notified something is missing, just accept</h2>

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>


Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


Submit the job:


<pre>
sbatch 00_index.sbatch;
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.


We can make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>


⏰ It should take ~5–10 min to complete.



A prepared script is also available:


<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Copy the template and create <code>01_fastp.sbatch</code>:


<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


Put the following into the SBATCH script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Store the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# Trim first 5 bases (often lower quality)
--trim_front1 5 \
# Threads
-w 10 \
# Minimal read length (remove reads shorter than this after trimming)
-l 30 \
-h ${sample}.html
</pre>


Copy the HTML report to your local computer and open it in a browser:


<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete with 6 CPUs.


Question: Check the HTML report. What percentage of reads are kept?

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Put the following into your mapping script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI" # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


⏰ It should take ~5 min to complete with 6 CPUs.



Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️

<ul>
<li>Fragment size histogram</li>
<li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
</pre>


⏰ It should take ~35 min to complete with 10 CPUs.


Question: Is it possible to retrieve multiple contacting regions?

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.


Run filtering:

<pre>
tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6 \
--tmpdb ${TMPDIR}
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Question: The total number of filtered reads is not equal to the initial number of reads… Why?

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.



If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000" # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
</pre>

<pre>
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


⏰ It should take ~2 min to complete with 6 CPUs.



⚠️ Run another version with <code>norm="raw"</code> to compare later.



Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Each job is assigned a <code><job_id></code>. This helps retrieve results from specific runs (especially when testing parameters).



If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:



https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.


<pre>
# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>

<pre>
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;
</pre>

<hr>

Congratulations, you finished the exercise!

Exercise and answers

2026-01-07T10:12:57Z

Mick: Created page with "<h2>Overview</h2> In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017): from FASTQ files to contact matrix and beyond. A Primer into 3D Genomics: A Mini-Workshop Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 9 January 2026, DTU <hr> <h2>Outline of the exercises</h2> <ol> <li>Preprocess Hi-C FASTQ data</li> <li>Index reference genome</li> <li>Use TADbit to: <ol>..."

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit later</h2>

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
conda activate /home/people/${USER}/envs/tadbit_course
# $USER is your user; it's an environment variable so no need to change it.

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>


Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


Submit the job:


<pre>
sbatch 00_index.sbatch;
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.


We can make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>


⏰ It should take ~5–10 min to complete.



A prepared script is also available:


<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Copy the template and create <code>01_fastp.sbatch</code>:


<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


Put the following into the SBATCH script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Store the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# Trim first 5 bases (often lower quality)
--trim_front1 5 \
# Threads
-w 10 \
# Minimal read length (remove reads shorter than this after trimming)
-l 30 \
-h ${sample}.html
</pre>


Copy the HTML report to your local computer and open it in a browser:


<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete with 6 CPUs.


Question: Check the HTML report. What percentage of reads are kept?

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Put the following into your mapping script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI" # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


⏰ It should take ~5 min to complete with 6 CPUs.



Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️

<ul>
<li>Fragment size histogram</li>
<li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
</pre>


⏰ It should take ~35 min to complete with 10 CPUs.


Question: Is it possible to retrieve multiple contacting regions?

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.


Run filtering:

<pre>
tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6 \
--tmpdb ${TMPDIR}
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Question: The total number of filtered reads is not equal to the initial number of reads… Why?

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.



If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000" # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
</pre>

<pre>
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


⏰ It should take ~2 min to complete with 6 CPUs.



⚠️ Run another version with <code>norm="raw"</code> to compare later.



Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Each job is assigned a <code><job_id></code>. This helps retrieve results from specific runs (especially when testing parameters).



If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:



https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.


<pre>
# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>

<pre>
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;
</pre>

<hr>

Congratulations, you finished the exercise!

SNP calling exercise part 1

2026-01-07T10:00:14Z

Mick:

<h2>Overview</h2>

In this exercise you will perform basic variant calling and start exploring VCF files. You will:

<ol>
<li>Run a germline variant caller on whole-genome sequencing data</li>
<li>Get acquainted with VCF and gVCF formats</li>
<li>Count and subset variants using command-line tools</li>
<li>Compare “known” vs “novel” variants</li>
</ol>

First:
<ol>
<li>Navigate to your home directory</li>
<li>Create a directory called <code>variant_call</code></li>
<li>Enter the <code>variant_call</code> directory</li>
</ol>

<hr>

<h2>Genotyping</h2>

We will genotype chromosome 20 from a BAM file that has been pre-processed (sorted, duplicate-marked, etc.)

The sample is a [https://en.wikipedia.org/wiki/Han_Chinese Han Chinese] male, sequenced to approximately 24.6x coverage.

<pre>
/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam
</pre>

The BAM file is already indexed.

We will use GATK HaplotypeCaller to generate a gVCF file. A typical command looks like this:

<pre>
/home/ctools/gatk-4.6.2.0/gatk --java-options "-Xmx10g" HaplotypeCaller \
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-I [INPUT_BAM] \
-L chr20 \
-O [OUTPUT_GVCF] \
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
-ERC GVCF
</pre>

Explanation of the key options:
<ul>
<li><code>-R</code> – reference genome (GRCh38)</li>
<li><code>-I</code> – input BAM file (here: <code>NA24694.bam</code>)</li>
<li><code>-L chr20</code> – restrict calling to chromosome 20</li>
<li><code>--dbsnp</code> – annotate with known variants from dbSNP</li>
<li><code>-ERC GVCF</code> – emit a gVCF (includes both variant and non-variant blocks)</li>
</ul>

Suggested output name:

<pre>
NA24694.gvcf.gz
</pre>

This command may take a while to complete. If you are short on time, you can use the precomputed file instead:

<pre>
/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz
</pre>

Take a quick look at the gVCF:

<pre>
zcat NA24694.gvcf.gz | less -S
</pre>

Notes:
<ul>
<li>Lines starting with <code>#</code> form the header.</li>
<li>Data lines have at least 10 columns. The first five are:
<ul>
<li>CHROM – chromosome name</li>
<li>POS – genomic coordinate</li>
<li>ID – variant identifier (e.g. dbSNP ID, or <code>.</code> if unknown)</li>
<li>REF – reference allele</li>
<li>ALT – alternate allele(s)</li>
</ul>
</li>
<li>In gVCFs, you will often see <code><NON_REF></code> as the ALT allele for invariant blocks.</li>
</ul>

<h3>Indexing the gVCF</h3>

Before using the gVCF as input for other tools, index it with tabix:

<pre>
/home/ctools/htslib-1.20/tabix -f -p vcf [INPUT_GVCF]
</pre>

This creates an index file with extension <code>.tbi</code>, allowing fast random access by position.

<h3>Genotyping the gVCF (producing a standard VCF)</h3>

Next, we convert the gVCF into a standard VCF with genotypes only at variant sites using GATK GenotypeGVCFs:

<pre>
/home/ctools/gatk-4.6.2.0/gatk GenotypeGVCFs \
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-V [INPUT_GVCF] \
-O [OUTPUT_VCF] \
-L chr20 \
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>

Suggested output name:

<pre>
NA24694.vcf.gz
</pre>

This step is usually faster than HaplotypeCaller.

Index the VCF with tabix:

<pre>
/home/ctools/htslib-1.20/tabix -f -p vcf NA24694.vcf.gz
</pre>

As with BAM indices, the VCF index allows fast region-based queries, but here using tabix rather than samtools.

<hr>

<h3>Getting Acquainted with VCF Files</h3>

Q1. Using:

<pre>
/home/ctools/bcftools-1.23/bcftools stats [INPUT_VCF]
</pre>

Find the line that starts with:

<pre>
SN 0 number of SNPs
</pre>

How many SNPs are present in your VCF?

<hr>

You can query specific regions of the VCF using tabix:

<pre>
/home/ctools/htslib-1.20/tabix [INPUT_VCF] [CHROM]:[START]-[END]
</pre>

For a single coordinate:

<pre>
/home/ctools/htslib-1.20/tabix [INPUT_VCF] [CHROM]:[POS]-[POS]
</pre>

<hr>

Q2. Using tabix and <code>wc -l</code>, how many total variants are present in the 1 Mb region:

<pre>
chr20:32000000-33000000
</pre>

(Remember that tabix only returns data lines, so you can safely count lines with <code>wc -l</code>.)

<hr>

Q3. <code>bcftools</code> can subset and filter VCF files.

Type:

<pre>
/home/ctools/bcftools-1.23/bcftools view
</pre>

and look at the help text. Using <code>bcftools view</code>, determine how many SNPs (excluding indels and multi-allelic variants) are present in the same region:

<pre>
chr20:32000000-33000000
</pre>

Hints:
<ul>
<li>Filter for variant type (SNPs only).</li>
<li>Use <code>-H</code> to avoid counting header lines.</li>
<li>Pipe the result to <code>wc -l</code> to count variants.</li>
</ul>

<hr>

Q4. Retrieve the variants at:

<pre>
chr20:32011209
chr20:32044279
</pre>

You can use either tabix or bcftools, for example:

<pre>
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32011209-32011209
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32044279-32044279
</pre>

For each site, answer:
<ol>
<li>What is the genotype (e.g. 0/0, 0/1, 1/1)?</li>
<li>What is the allele depth (AD) – how many reads support each allele?</li>
<li>What is the total depth of coverage (DP) at this site?</li>
<li>What is the genotype quality (GQ)?</li>
<li>What are the genotype likelihoods (PL)?</li>
</ol>

Use the VCF specification (<a href="https://samtools.github.io/hts-specs/VCFv4.2.pdf">VCFv4.2</a>, especially section 1.4 “Data lines”) and GATK’s VCF documentation (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format">GATK VCF Format</a>) to interpret the FORMAT fields.

<hr>

Q5. Inspect the SNPs at positions:

<pre>
chr20 32974911
chr20 64291638
</pre>

One of these SNPs has poor quality, the other has good quality.

<ul>
<li>Which is which?</li>
<li>Why do you think this is the case? (Hint: think about depth, allele balance, and overall evidence.)</li>
</ul>

<hr>

Q6. Using the same region as in Q2/Q3:

<pre>
chr20:32000000-33000000
</pre>

How many SNPs in this region are novel, i.e. do not have an ID in dbSNP?

Hints:
<ul>
<li>The 3rd column (ID) contains dbSNP IDs (typically starting with <code>rs</code>) or <code>.</code> for novel variants.</li>
<li>You can use <code>cut</code> to extract the ID column.</li>
<li><code>grep</code> with <code>-v</code> can be used to exclude lines containing <code>rs</code>, or to include only those lines.</li>
</ul>

<hr>

Q7. Compare your result from Q6 (number of novel SNPs) to the number of SNPs you found in Q3 (total SNPs in the region).

<ul>
<li>What fraction of SNPs in this region are novel?</li>
<li>Does this fraction seem reasonable, given that human variation databases are large but still incomplete?</li>
</ul>

<hr>

Congratulations, you finished the exercise!

Alignment exercise

2026-01-07T09:47:53Z

Mick:

<h2>Overview</h2>

In this exercise you will practice aligning NGS data and working with alignment files.

<ol>
<li>Navigate to your home directory.</li>
<li>Create a directory called <code>align</code>.</li>
<li>Navigate to the <code>align</code> directory.</li>
</ol>

We will align two types of NGS data:
<ol>
<li>Pseudomonas single-end Illumina reads</li>
<li>Human paired-end Illumina reads</li>
</ol>

<hr>

<h2>P. aeruginosa single-end Illumina reads</h2>

<h3>Alignment using bwa mem</h3>

We will align single-end reads that have been trimmed from P. aeruginosa.

Raw data:
<pre>/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz</pre>

Trimmed data:
<pre>/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz</pre>

Reference genome:
<pre>/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta</pre>

The basic <code>bwa mem</code> command to align single-end reads is:
<pre>bwa mem [reference.fasta] [reads.fastq.gz] > [output.sam]</pre>

Remember: the <code>></code> operator redirects standard output (STDOUT) to a file.

We have discussed multiplexing and read groups. It is good practice to add a read group ID and sample name during alignment. For example, if the read group is <code>RG38</code> and the sample is <code>SMPL96</code>:

<pre>bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference.fasta] [reads.fastq.gz] > [output.sam]</pre>

This information is crucial when you later merge multiple BAM files, so you can trace which reads came from which library or sample.

Task: Align the trimmed FASTQ file using the command above.

Q1: If you were not told which FASTQ file contains the trimmed reads, how could you determine it from the files themselves? (Hint: think of at least three different ways.)

<hr>

<h3>Inspecting the alignment</h3>

Assume you named your output file <code>SRR8002634_1.sam</code>. You can view it as:
<pre>less -S SRR8002634_1.sam</pre>

The <code>-S</code> option prevents line wrapping; press <code>q</code> to quit. Use the slides and the
[https://samtools.github.io/hts-specs/SAMv1.pdf official SAM specification] to interpret each field.

Answer the following:

Q2: How many lines does the header have (lines starting with <code>@</code>)?

Q3: What is the genomic coordinate (reference name and position) of the first read <code>SRR8002634.1</code>?

Q4: What is the mapping quality of the third read <code>SRR8002634.3</code>? What does that mapping quality tell you about this read?

Q5: Using the SAM flag definitions (see
[https://broadinstitute.github.io/picard/explain-flags.html Picard flag explanation]), determine among the first 8 reads how many map to the forward (+) strand and how many to the reverse (–) strand.

Q6: Is the 10th read <code>SRR8002634.11</code> unmapped? (Note: <code>SRR8002634.9</code> was removed by trimming, so numbering skips.) How did you determine this from the SAM fields?

To get basic alignment statistics, use:
<pre>samtools flagstat [input.sam]</pre>

Below is a brief explanation of the fields reported by <code>flagstat</code>:

<table class="wikitable">
<tr>
<th>Category</th>
<th>Meaning</th>
</tr>
<tr>
<td>mapQ</td>
<td>Mapping quality</td>
</tr>
<tr>
<td>QC-passed reads</td>
<td>Reads not marked as QC-failed; these are typically used for analysis.</td>
</tr>
<tr>
<td>QC-failed reads</td>
<td>Reads flagged as having problems by the processing pipeline; downstream tools usually ignore them.</td>
</tr>
<tr>
<td>total</td>
<td>Total number of alignments reported.</td>
</tr>
<tr>
<td>secondary</td>
<td>Additional alignments for reads that map equally well to multiple locations.</td>
</tr>
<tr>
<td>supplementary</td>
<td>Alignments for chimeric or split reads where different parts map to different locations.</td>
</tr>
<tr>
<td>duplicates</td>
<td>Reads marked as duplicates (e.g. PCR duplicates); will be discussed in the next class.</td>
</tr>
<tr>
<td>mapped</td>
<td>Number of reads with at least one reported alignment (not unmapped).</td>
</tr>
<tr>
<td>paired in sequencing</td>
<td>Reads that were sequenced as part of a pair (not single-end).</td>
</tr>
<tr>
<td>read1</td>
<td>First read in the pair (forward).</td>
</tr>
<tr>
<td>read2</td>
<td>Second read in the pair (reverse).</td>
</tr>
<tr>
<td>properly paired</td>
<td>Pairs that face each other and are within the expected insert size range.</td>
</tr>
<tr>
<td>with itself and mate mapped</td>
<td>Both the read and its mate are mapped (whether or not properly paired).</td>
</tr>
<tr>
<td>singletons</td>
<td>Reads that are mapped but whose mate is unmapped.</td>
</tr>
<tr>
<td>with mate mapped to a different chr</td>
<td>Reads whose mate is mapped to a different chromosome.</td>
</tr>
</table>

Q7: According to <code>samtools flagstat</code>, what fraction of reads did not align to the reference?

<hr>

<h3>Working with alignments</h3>

<h4>Format conversion</h4>

This should be the first and hopefully last time you work directly with SAM for large files.

First, check the SAM file size:
<pre>ls -lh SRR8002634_1.sam</pre>

Convert SAM to BAM:
<pre>samtools view -bS [input.sam] > [output.bam]</pre>

Check the BAM file size:
<pre>ls -lh SRR8002634_1.bam</pre>

<code>-l</code> gives a detailed listing (permissions, size, date). <code>-h</code> shows file sizes in human-readable form (e.g. 2.4M instead of 2469134 bytes).

The BAM file contains exactly the same alignments as the SAM file, but in binary form. To view it as SAM:
<pre>samtools view [input.bam] | less -S</pre>

You can filter reads based on SAM flags. For example, to include only unmapped reads:
<pre>samtools view -f 0x4 [input.bam]</pre>

To exclude unmapped reads:
<pre>samtools view -F 0x4 [input.bam]</pre>

The flag <code>0x4</code> corresponds to “read unmapped” (see the
[https://broadinstitute.github.io/picard/explain-flags.html flag documentation]).

Q8: What is the size ratio of SAM to BAM (SAM size divided by BAM size)?

Now convert BAM to CRAM, which compresses further using the reference:
<pre>samtools view -C -T [reference.fasta] [input.bam] > [output.cram]</pre>

Use the same reference FASTA you used for mapping. Check the CRAM file size with <code>ls -lh</code>.

To view CRAM as SAM:
<pre>samtools view -T [reference.fasta] [input.cram] | less -S</pre>

Q9: What is the size ratio of BAM to CRAM?

To save space, please remove the SAM and CRAM files (we will work with BAM only):
<pre>rm [file]</pre>

<h4>Sorting</h4>

Sort the BAM file by genomic coordinate:
<pre>samtools sort [input.bam] > [output.sorted.bam]</pre>

Be careful not to overwrite the original BAM file; for example:
<ul>
<li><code>input.bam</code> = <code>SRR8002634_1.bam</code></li>
<li><code>output.sorted.bam</code> = <code>SRR8002634_1.sorted.bam</code></li>
</ul>

Use <code>samtools view</code> and <code>less -S</code> to confirm that reads are ordered by reference and coordinate.

Index the sorted BAM file:
<pre>samtools index [input.sorted.bam]</pre>

Note: you cannot index an unsorted BAM file.

<h4>Retrieving a particular region</h4>

Once sorted and indexed, you can retrieve reads from a specific region:
<pre>samtools view [input.sorted.bam] [regionID]:[start]-[end]</pre>

For example, to get reads overlapping positions 1,000,000–1,000,100 on chromosome <code>NC_002516.2</code>:
<pre>samtools view SRR8002634_1.sorted.bam NC_002516.2:1000000-1000100</pre>

Q10: How many reads are aligned between positions 2,000,000 and 3,000,000 on the reference <code>NC_002516.2</code>? 
Hint: do not save to a file; instead use:
<pre>samtools view [options] | wc -l</pre>

Q11: How many reads with mapping quality ≥ 30 are aligned between positions 2,000,000 and 3,000,000 on <code>NC_002516.2</code>? 
Hint: run <code>samtools view</code> without arguments to see its options.

<h4>Average coverage</h4>

We will use <code>mosdepth</code> to measure the average coverage (mean number of reads covering each base in the genome):
<pre>/home/ctools/mosdepth/mosdepth [output_prefix] [input.sorted.bam]</pre>

Use any prefix you like (e.g. <code>SRR8002634_1</code>). <code>mosdepth</code> will write a summary file named <code>[output_prefix].mosdepth.summary.txt</code>.

Check the summary and see if the reported coverage makes sense given the data.

Q12: On average, how many reads cover a base in the genome? What is the maximum coverage (maximum number of reads covering a single position)?

You can also inspect per-position coverage using:
<pre>samtools mpileup [input.sorted.bam] | less -S</pre>

<hr>

<h3>The wrong reference genome?</h3>

Q13: Suppose you accidentally aligned the reads to a different bacterial reference genome, e.g. Yersinia pestis (the plague bacterium), a distant relative of Pseudomonas alcaligenes. Would the number of aligned reads go up or down compared to the correct reference? Why? What if the other species was very closely related — would you expect more or fewer reads to align?

Optional bonus: Try it.

Yersinia pestis reference:
<pre>/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta</pre>

Pseudomonas alcaligenes reference:
<pre>/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta</pre>

<hr>

<h2>Human paired-end Illumina reads</h2>

<h3>Aligning</h3>

We will align exome-seq reads from a [https://en.wikipedia.org/wiki/Yoruba_people Yoruba] female.

Raw data:
<pre>/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz
</pre>

<code>NA19201_1.fastq.gz</code> contains the forward reads; <code>NA19201_2.fastq.gz</code> contains the reverse reads. These reads are already trimmed.

Your goal is to write a single command line that:
<ol>
<li>Uses <code>bwa mem</code> to align the paired-end reads and produce SAM output.</li>
<li>Converts SAM to BAM.</li>
<li>Sorts the BAM file.</li>
</ol>

<code>bwa mem</code> syntax for paired-end reads:
<pre>bwa mem [reference.fasta] [forward.fastq.gz] [reverse.fastq.gz]</pre>

Human reference (GRCh38):
<pre>/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa</pre>

Note: There are multiple versions of the human reference genome (e.g. hg18, hg19, hg38). Coordinates and even sequences can differ between versions. Always make sure to use the same reference version consistently in all steps of your analysis.

If possible, add a read group and sample name. For example, if read group is <code>RG26</code> and sample is <code>YRB42</code>:
<pre>-R "@RG\tID:RG26\tSM:YRB42"</pre>

Converting and sorting:
<ul>
<li><code>samtools view -bS [input.sam]</code> converts SAM to BAM.</li>
<li>When reading from STDIN, use <code>/dev/stdin</code> as the input.</li>
<li><code>samtools sort [input.bam]</code> sorts the BAM file.</li>
</ul>

Your combined command should:
<ul>
<li>Run <code>bwa mem</code>,</li>
<li>Pipe SAM output to <code>samtools view</code>,</li>
<li>Pipe BAM output to <code>samtools sort</code>,</li>
<li>Redirect the final sorted BAM to <code>NA19201.bam</code>.</li>
</ul>

The alignment may take around 10 minutes.

Q14: Write the full one-line command that performs alignment, SAM->BAM conversion, and sorting using pipes, and saves output as <code>NA19201.bam</code>.

Q15: What are two major advantages of using UNIX pipes instead of running each command separately and writing intermediate files?

Note: For speed, the provided reads only contain sequences mapping to <code>chr20</code>.

<hr>

<h3>Alignment statistics</h3>

<h4>flagstat</h4>

Q16: Using <code>samtools flagstat</code>, what proportion of reads aligned to the reference?

Q17: Using the same output, how many read pairs are marked as properly paired?

Q18: Index the BAM file, then run:
<pre>samtools view [input.bam] [chromosome]</pre>

Count how many reads align to <code>chr20</code> (hint: pipe to <code>wc -l</code>). How many total reads are aligned to <code>chr20</code>?

<h4>stat</h4>

Generate additional alignment statistics using:

<pre>
samtools stat [input.bam] > NA19201.stat
</pre>

Generate plots from the statistics file:

<pre>
plot-bamstats -p NA19201 NA19201.stat
</pre>

Viewing the BAM statistics report:

The command above generates a set of PNG image files containing
various BAM statistics (e.g. insert size, base composition, quality by cycle).
The plots are created in the current directory.

If you are using MobaXterm, you can open the PNG files directly from the
left-hand file panel.

If you are using macOS (or a standard terminal), copy the PNG files to your
local computer and open them using any image viewer. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/NA19201*.png .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.

Q19: Look at the insert size distribution plot. What is the most
common insert size (approximately)?

<hr>

<h3>Inspecting the alignment with <code>samtools tview</code></h3>

We will use the text-based viewer <code>samtools tview</code> to inspect the human alignment around a potential variant.

First, make sure your BAM file is indexed:
<pre>samtools index NA19201.bam</pre>

Then start <code>samtools tview</code>:
<pre>samtools tview NA19201.bam /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa</pre>

This opens the alignment in your terminal.

To jump to the region of interest (<code>chr20:35,581,362</code>):
<ol>
<li>Press <code>g</code> (for “goto”).</li>
<li>Type <code>chr20:35581362</code> and press Enter.</li>
</ol>

You should now see:
<ul>
<li>The reference sequence on the top line.</li>
<li>Aligned reads below. Matching bases are often shown as <code>.</code> or <code>,</code>, while mismatches are shown as the actual base (A/C/G/T).</li>
</ul>

Useful keys:
<ul>
<li><code>?</code> – show help.</li>
<li>Arrow keys – move left/right/up/down.</li>
<li><code>q</code> – quit <code>tview</code>.</li>
</ul>

Use this view to answer the following:

Q20: At position <code>chr20:35,581,362</code>, what bases are present in the sample reads?

Q21: How many reads support the non-consensus base at this position? (Count the reads showing the alternative base in <code>tview</code>.)

Q22: Based on the fraction of reads supporting the non-reference base, does this variant look more like a heterozygous or a homozygous variant? Explain briefly.

<hr>

Please find the answers [[Alignment_exercise_answers|here]].

Congratulations, you finished the exercise!

Alignment exercise

2026-01-07T09:29:03Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit later</h2>

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>


Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


Submit the job:


<pre>
sbatch 00_index.sbatch;
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.


We can make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>


⏰ It should take ~5–10 min to complete.



A prepared script is also available:


<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Copy the template and create <code>01_fastp.sbatch</code>:


<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


Put the following into the SBATCH script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Store the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# Trim first 5 bases (often lower quality)
--trim_front1 5 \
# Threads
-w 10 \
# Minimal read length (remove reads shorter than this after trimming)
-l 30 \
-h ${sample}.html
</pre>


Copy the HTML report to your local computer and open it in a browser:


<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete with 6 CPUs.


Question: Check the HTML report. What percentage of reads are kept?

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Put the following into your mapping script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI" # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


⏰ It should take ~5 min to complete with 6 CPUs.



Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️

<ul>
<li>Fragment size histogram</li>
<li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
</pre>


⏰ It should take ~35 min to complete with 10 CPUs.


Question: Is it possible to retrieve multiple contacting regions?

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.


Run filtering:

<pre>
tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6 \
--tmpdb ${TMPDIR}
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Question: The total number of filtered reads is not equal to the initial number of reads… Why?

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.



If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000" # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
</pre>

<pre>
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


⏰ It should take ~2 min to complete with 6 CPUs.



⚠️ Run another version with <code>norm="raw"</code> to compare later.



Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Each job is assigned a <code><job_id></code>. This helps retrieve results from specific runs (especially when testing parameters).



If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:



https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.


<pre>
# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>

<pre>
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;
</pre>

<hr>

Congratulations, you finished the exercise!

Alignment exercise

2026-01-07T09:22:51Z

Mick:

<h2>Overview</h2>


In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.



A Primer into 3D Genomics: A Mini-Workshop 
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen 
9 January 2026, DTU


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index reference genome</li>
<li>Use TADbit to:
<ol>
<li>Map reads to reference genome (<code>map</code>)</li>
<li>Get intersection (<code>parse</code>)</li>
<li>Filter reads (<code>filter</code>)</li>
<li>Normalize (<code>normalize</code>)</li>
<li>Generate matrices (<code>bin</code>)</li>
<li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
</ol>
</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit later</h2>

<pre>
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
./setup_TADbit.sh
</pre>


You should get (as the only output) the help from the program — this means the environment is up and running.



Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:


<ul>
<li><code>fastq</code> – raw data</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.



Remember to activate the tadbit conda environment.


<pre>
# Move to your home
cd;

# Activate TADbit environment
conda activate /home/people/${USER}/envs/tadbit_course
# $USER is your user; it's an environment variable so no need to change it.

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>

Putting things into an SBATCH script


A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:


<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:


<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>


Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>


<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>


Submit the job:


<pre>
sbatch 00_index.sbatch;
</pre>

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.


We can make a symlink to the reference genome in our folder so that we do not have to copy it:


<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>


⏰ It should take ~5–10 min to complete.



A prepared script is also available:


<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>

<hr>

<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>


While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.



Copy the template and create <code>01_fastp.sbatch</code>:


<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


Put the following into the SBATCH script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Store the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# Trim first 5 bases (often lower quality)
--trim_front1 5 \
# Threads
-w 10 \
# Minimal read length (remove reads shorter than this after trimming)
-l 30 \
-h ${sample}.html
</pre>


Copy the HTML report to your local computer and open it in a browser:


<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


⏰ It should take ~1 min to complete with 6 CPUs.


Question: Check the HTML report. What percentage of reads are kept?

<hr>

<h2>Mapping to the reference genome</h2>


TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.



Put the following into your mapping script:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI" # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


⏰ It should take ~5 min to complete with 6 CPUs.



Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.



After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.


Question: How may restriction enzyme choice influence the experiment? ✂️

<ul>
<li>Fragment size histogram</li>
<li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>

<hr>

<h2>Finding the intersection of mapped reads (parse)</h2>


Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.


⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
</pre>


⏰ It should take ~35 min to complete with 10 CPUs.


Question: Is it possible to retrieve multiple contacting regions?

<hr>

<h2>Filtering interactions</h2>


TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.


Run filtering:

<pre>
tadbit filter \
--workdir ${wd} \
--apply 1 2 3 4 6 7 8 9 10 \
--cpus 6 \
--tmpdb ${TMPDIR}
</pre>

<hr>

<h2>Check the amount of filtered data and past commands</h2>


<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.


<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less
</pre>

Question: How many valid pairs do we keep?
Question: The total number of filtered reads is not equal to the initial number of reads… Why?

<hr>

<h2>To normalize or to not normalize</h2>


In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.



Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.



Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.



Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a minimum number of counts per bin.



If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000" # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
</pre>

<pre>
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}
</pre>


⏰ It should take ~2 min to complete with 6 CPUs.



⚠️ Run another version with <code>norm="raw"</code> to compare later.



Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.



Each job is assigned a <code><job_id></code>. This helps retrieve results from specific runs (especially when testing parameters).



If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:



https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


<hr>

<h2>Binning and viewing matrices</h2>


Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.


<pre>
# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>

<pre>
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;
</pre>

<hr>

Congratulations, you finished the exercise!

Alignment exercise

2026-01-07T09:18:43Z

Mick:

<h2>Overview</h2>


In this exercise you will explore Hi-C data analysis using TADbit,
from raw FASTQ files to normalized contact matrices and domain-level
interpretation.



The goal is to understand what each step of the pipeline does, which
parameters matter, and how choices affect downstream interpretation.


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index a reference genome</li>
<li>Map reads to the reference genome</li>
<li>Parse and filter read pairs</li>
<li>Normalize Hi-C contact matrices</li>
<li>Generate and inspect contact matrices</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit</h2>


Before starting, set up a conda environment with all required dependencies.


<pre>
cd; # Home directory
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .
bash ./setup_TADbit.sh
</pre>


If successful, the command should print the <code>tadbit</code> help message.
This confirms that the environment is correctly installed.



Inside <code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> you will find:


<ul>
<li><code>fastq</code> – raw Hi-C FASTQ files</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome FASTA and index files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before mapping Hi-C reads, the reference genome must be indexed for the
GEM mapper used by TADbit.



Use the provided reference genome in the <code>refGenome</code> directory.
This step only needs to be done once per reference.


<hr>

<h2>Mapping Hi-C reads</h2>


Hi-C reads are paired-end and must be mapped with special care to preserve
pairing information.



Mapping assigns each read to a genomic coordinate in the reference genome.
Unmapped and ambiguously mapped reads will be handled in later steps.


<hr>

<h2>Parsing mapped reads</h2>


After mapping, TADbit parses the BAM file to identify valid Hi-C read pairs.



This step assigns read pairs to different categories (e.g. valid pairs,
dangling ends, self-circles, duplicates).


<hr>

<h2>Filtering reads</h2>


Filtering does not remove reads immediately. Instead, reads are
classified into categories.



These categories are later used during normalization to decide which
reads contribute to the contact matrix.



To summarize the results of mapping, parsing, and filtering:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample
tadbit describe . | less
</pre>

Q1: How many valid pairs are retained after filtering?

Q2: Why does the total number of filtered reads not equal the
initial number of read pairs?


Hint: read categories are not mutually exclusive.


<hr>

<h2>To normalize or to not normalize</h2>


Up to this point, reads have only been classified.
No reads have been excluded yet.



Normalization is the step where you decide which categories to include
and how to correct for technical biases.



Normalization in TADbit computes a bias vector (one value per bin),
which corrects interaction counts for sequencing depth, mappability, and
other systematic effects.


<blockquote>
Important: During normalization, bad columns (bins with low
counts or poor mappability) are removed from the matrix.
</blockquote>


Several normalization strategies are available:



See <code>tadbit normalize --help</code> for details.



A common approach is to require a minimum number of counts per bin
and to explicitly exclude problematic genomic regions.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/${sample}" # working directory
res="100000" # resolution (100 kb)
norm="ICE"
min_count=5
</pre>


To exclude specific regions (e.g. sex chromosomes or poorly assembled
regions), use the <code>--badcols</code> option.



⏰ The normalization step should take approximately 2 minutes using 6 CPUs.


Task: Run normalization twice: once with <code>norm="ICE"</code>
and once with <code>norm="raw"</code>. Compare the results later.

<hr>

<h2>Contact matrices</h2>


After normalization, TADbit generates Hi-C contact matrices at the chosen
resolution.



These matrices represent interaction frequencies between genomic bins
and are the basis for downstream analyses such as TAD detection.


Q3: How does changing the resolution affect the appearance of the
contact matrix?

<hr>

Congratulations, you finished the TADbit exercise!

Postprocess exercise

2026-01-07T09:13:03Z

Mick:

<h2>Overview</h2>

In this exercise, you will perform essential post-alignment processing on BAM files to prepare them for reliable variant calling. Raw aligned BAM files often contain artifacts that can lead to false variants if not handled correctly. Today you will:

<ol>
<li>Mark duplicate reads in BAM files</li>
<li>Examine the effect of duplicate marking on read interpretation</li>
<li>Merge multiple sequencing libraries from the same individual into a single BAM file</li>
</ol>

First:
<ol>
<li>Navigate to your home directory</li>
<li>Create a directory called <code>postalign</code></li>
<li>Enter the <code>postalign</code> directory</li>
</ol>

<hr>

<h2>Duplicate Marking</h2>

We will work with data from a Han Chinese individual (HG00418), sequenced to approximately 40× coverage using Illumina paired-end sequencing. For speed, we only use reads mapping to chromosome 20 and only two sequencing libraries.

Library 1 BAM file:
<pre>
/home/projects/22126_NGS/exercises/dupremoval/ERR016028_chr20_sort.bam
</pre>

The file is already trimmed, aligned, and sorted.

We will mark duplicate reads using Picard MarkDuplicates. The general command is:

<pre>
java -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates \
-I [input.bam] \
-M [metrics.txt (this is part of the output, call it what you wish)] \
-O [output.bam]
</pre>

Suggested output name: <code>ERR016028_chr20_sort_markdup.bam</code>

Q1: After running Picard, how many reads were marked as duplicates?
(Hint: this number is printed in the Picard metrics output file.)

<hr>

<h3>Inspecting the Effect of Duplicate Marking</h3>

To view reads in a specific genomic region, use:

<pre>
samtools view [input.sorted.bam] [chrom]:[start]-[end]
</pre>

The BAM file must be indexed. Picard preserves sorting, so you do not need to re-sort it.

Inspect reads in the following region for both the original file and your duplicate-marked file:

<pre>
chr20:45996339-45996839
</pre>

Identify the two reads:
<ul>
<li><code>ERR016028.5947720</code></li>
<li><code>ERR016028.18808080</code></li>
</ul>

Q2: Why did MarkDuplicates consider these reads to be duplicates?

Q3: Which of the two reads was marked as a duplicate, and how can you tell from the SAM flag or tags?

<hr>

<h2>Merging BAM Files</h2>

Often, multiple sequencing libraries (or sequencing runs) exist for the same biological sample. Before variant calling, these must be merged into a single BAM file.

You will merge:

<ul>
<li>Your duplicate-marked file</li>
<li>The second library file:</li>
</ul>

<pre>
/home/projects/22126_NGS/exercises/dupremoval/ERR016025_chr20_sort_markdup.bam
</pre>

Run:

<pre>samtools</pre>

and find the command capable of merging multiple BAM files while preserving read groups and writing an index automatically. The command should keep the file sorted and generate the <code>.bai</code> index in the same step.

Q4: Which <code>samtools</code> command performs merging, with options to keep read groups and write the index?

Use the options:

<pre>
-c --write-index
</pre>

Your merged BAM file should be named:

<pre>
HG00418_chr20_sort_markdup.bam
</pre>

Inspect your merged file using:

<pre>
samtools view HG00418_chr20_sort_markdup.bam | less -S
</pre>

Q5: Which SAM/BAM field indicates the sample or library of origin for each read?

Q6: What is the term for pooling multiple samples together into a single sequencing run?

Q7: What is the computational step where we separate pooled reads back into individual samples?

<hr>

You can find the answers here: <a href="Postprocess_exercise_answers">Postprocess_exercise_answers</a>

Congratulations—you have completed the exercise!

SNP calling exercise part 1

2026-01-07T09:11:02Z

Mick:

<h2>Overview</h2>

In this exercise you will perform basic variant calling and start exploring VCF files. You will:

<ol>
<li>Run a germline variant caller on whole-genome sequencing data</li>
<li>Get acquainted with VCF and gVCF formats</li>
<li>Count and subset variants using command-line tools</li>
<li>Compare “known” vs “novel” variants</li>
</ol>

First:
<ol>
<li>Navigate to your home directory</li>
<li>Create a directory called <code>variant_call</code></li>
<li>Enter the <code>variant_call</code> directory</li>
</ol>

<hr>

<h2>Genotyping</h2>

We will genotype chromosome 20 from a BAM file that has been pre-processed (sorted, duplicate-marked, etc.)

The sample is a [https://en.wikipedia.org/wiki/Han_Chinese Han Chinese] male, sequenced to approximately 24.6x coverage.

<pre>
/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam
</pre>

The BAM file is already indexed.

We will use GATK HaplotypeCaller to generate a gVCF file. A typical command looks like this:

<pre>
/home/ctools/gatk-4.6.2.0/gatk --java-options "-Xmx10g" HaplotypeCaller \
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-I [INPUT_BAM] \
-L chr20 \
-O [OUTPUT_GVCF] \
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
-ERC GVCF
</pre>

Explanation of the key options:
<ul>
<li><code>-R</code> – reference genome (GRCh38)</li>
<li><code>-I</code> – input BAM file (here: <code>NA24694.bam</code>)</li>
<li><code>-L chr20</code> – restrict calling to chromosome 20</li>
<li><code>--dbsnp</code> – annotate with known variants from dbSNP</li>
<li><code>-ERC GVCF</code> – emit a gVCF (includes both variant and non-variant blocks)</li>
</ul>

Suggested output name:

<pre>
NA24694.gvcf.gz
</pre>

This command may take a while to complete. If you are short on time, you can use the precomputed file instead:

<pre>
/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz
</pre>

Take a quick look at the gVCF:

<pre>
zcat NA24694.gvcf.gz | less -S
</pre>

Notes:
<ul>
<li>Lines starting with <code>#</code> form the header.</li>
<li>Data lines have at least 10 columns. The first five are:
<ul>
<li>CHROM – chromosome name</li>
<li>POS – genomic coordinate</li>
<li>ID – variant identifier (e.g. dbSNP ID, or <code>.</code> if unknown)</li>
<li>REF – reference allele</li>
<li>ALT – alternate allele(s)</li>
</ul>
</li>
<li>In gVCFs, you will often see <code><NON_REF></code> as the ALT allele for invariant blocks.</li>
</ul>

<h3>Indexing the gVCF</h3>

Before using the gVCF as input for other tools, index it with tabix:

<pre>
/home/ctools/htslib-1.20/tabix -f -p vcf [INPUT_GVCF]
</pre>

This creates an index file with extension <code>.tbi</code>, allowing fast random access by position.

<h3>Genotyping the gVCF (producing a standard VCF)</h3>

Next, we convert the gVCF into a standard VCF with genotypes only at variant sites using GATK GenotypeGVCFs:

<pre>
/home/ctools/gatk-4.6.2.0/gatk GenotypeGVCFs \
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
-V [INPUT_GVCF] \
-O [OUTPUT_VCF] \
-L chr20 \
--dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>

Suggested output name:

<pre>
NA24694.vcf.gz
</pre>

This step is usually faster than HaplotypeCaller.

Index the VCF with tabix:

<pre>
/home/ctools/htslib-1.20/tabix -f -p vcf NA24694.vcf.gz
</pre>

As with BAM indices, the VCF index allows fast region-based queries, but here using tabix rather than samtools.

<hr>

<h3>Getting Acquainted with VCF Files</h3>

Q1. Using:

<pre>
bcftools stats [INPUT_VCF]
</pre>

Find the line that starts with:

<pre>
SN 0 number of SNPs
</pre>

How many SNPs are present in your VCF?

<hr>

You can query specific regions of the VCF using tabix:

<pre>
/home/ctools/htslib-1.20/tabix [INPUT_VCF] [CHROM]:[START]-[END]
</pre>

For a single coordinate:

<pre>
/home/ctools/htslib-1.20/tabix [INPUT_VCF] [CHROM]:[POS]-[POS]
</pre>

<hr>

Q2. Using tabix and <code>wc -l</code>, how many total variants are present in the 1 Mb region:

<pre>
chr20:32000000-33000000
</pre>

(Remember that tabix only returns data lines, so you can safely count lines with <code>wc -l</code>.)

<hr>

Q3. <code>bcftools</code> can subset and filter VCF files.

Type:

<pre>
bcftools view
</pre>

and look at the help text. Using <code>bcftools view</code>, determine how many SNPs (excluding indels and multi-allelic variants) are present in the same region:

<pre>
chr20:32000000-33000000
</pre>

Hints:
<ul>
<li>Filter for variant type (SNPs only).</li>
<li>Use <code>-H</code> to avoid counting header lines.</li>
<li>Pipe the result to <code>wc -l</code> to count variants.</li>
</ul>

<hr>

Q4. Retrieve the variants at:

<pre>
chr20:32011209
chr20:32044279
</pre>

You can use either tabix or bcftools, for example:

<pre>
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32011209-32011209
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32044279-32044279
</pre>

For each site, answer:
<ol>
<li>What is the genotype (e.g. 0/0, 0/1, 1/1)?</li>
<li>What is the allele depth (AD) – how many reads support each allele?</li>
<li>What is the total depth of coverage (DP) at this site?</li>
<li>What is the genotype quality (GQ)?</li>
<li>What are the genotype likelihoods (PL)?</li>
</ol>

Use the VCF specification (<a href="https://samtools.github.io/hts-specs/VCFv4.2.pdf">VCFv4.2</a>, especially section 1.4 “Data lines”) and GATK’s VCF documentation (<a href="https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format">GATK VCF Format</a>) to interpret the FORMAT fields.

<hr>

Q5. Inspect the SNPs at positions:

<pre>
chr20 32974911
chr20 64291638
</pre>

One of these SNPs has poor quality, the other has good quality.

<ul>
<li>Which is which?</li>
<li>Why do you think this is the case? (Hint: think about depth, allele balance, and overall evidence.)</li>
</ul>

<hr>

Q6. Using the same region as in Q2/Q3:

<pre>
chr20:32000000-33000000
</pre>

How many SNPs in this region are novel, i.e. do not have an ID in dbSNP?

Hints:
<ul>
<li>The 3rd column (ID) contains dbSNP IDs (typically starting with <code>rs</code>) or <code>.</code> for novel variants.</li>
<li>You can use <code>cut</code> to extract the ID column.</li>
<li><code>grep</code> with <code>-v</code> can be used to exclude lines containing <code>rs</code>, or to include only those lines.</li>
</ul>

<hr>

Q7. Compare your result from Q6 (number of novel SNPs) to the number of SNPs you found in Q3 (total SNPs in the region).

<ul>
<li>What fraction of SNPs in this region are novel?</li>
<li>Does this fraction seem reasonable, given that human variation databases are large but still incomplete?</li>
</ul>

<hr>

Congratulations, you finished the exercise!

Alignment exercise

2026-01-06T14:53:03Z

Mick:

<h2>Overview</h2>


In this exercise you will explore Hi-C data analysis using TADbit,
from raw FASTQ files to normalized contact matrices and domain-level
interpretation.



The goal is to understand what each step of the pipeline does, which
parameters matter, and how choices affect downstream interpretation.


<hr>

<h2>Outline of the exercises</h2>

<ol>
<li>Preprocess Hi-C FASTQ data</li>
<li>Index a reference genome</li>
<li>Map reads to the reference genome</li>
<li>Parse and filter read pairs</li>
<li>Normalize Hi-C contact matrices</li>
<li>Generate and inspect contact matrices</li>
</ol>

<hr>

<h2>Setup conda environment to run TADbit</h2>


Before starting, set up a conda environment with all required dependencies.


<pre>
cd; # Home directory
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .
./setup_TADbit.sh
</pre>


If successful, the command should print the <code>tadbit</code> help message.
This confirms that the environment is correctly installed.



Inside <code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> you will find:


<ul>
<li><code>fastq</code> – raw Hi-C FASTQ files</li>
<li><code>SCRIPTS</code> – scripts to run TADbit</li>
<li><code>refGenome</code> – reference genome FASTA and index files</li>
</ul>

<hr>

<h2>Index reference genome</h2>


Before mapping Hi-C reads, the reference genome must be indexed for the
GEM mapper used by TADbit.



Use the provided reference genome in the <code>refGenome</code> directory.
This step only needs to be done once per reference.


<hr>

<h2>Mapping Hi-C reads</h2>


Hi-C reads are paired-end and must be mapped with special care to preserve
pairing information.



Mapping assigns each read to a genomic coordinate in the reference genome.
Unmapped and ambiguously mapped reads will be handled in later steps.


<hr>

<h2>Parsing mapped reads</h2>


After mapping, TADbit parses the BAM file to identify valid Hi-C read pairs.



This step assigns read pairs to different categories (e.g. valid pairs,
dangling ends, self-circles, duplicates).


<hr>

<h2>Filtering reads</h2>


Filtering does not remove reads immediately. Instead, reads are
classified into categories.



These categories are later used during normalization to decide which
reads contribute to the contact matrix.



To summarize the results of mapping, parsing, and filtering:


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample
tadbit describe . | less
</pre>

Q1: How many valid pairs are retained after filtering?

Q2: Why does the total number of filtered reads not equal the
initial number of read pairs?


Hint: read categories are not mutually exclusive.


<hr>

<h2>To normalize or to not normalize</h2>


Up to this point, reads have only been classified.
No reads have been excluded yet.



Normalization is the step where you decide which categories to include
and how to correct for technical biases.



Normalization in TADbit computes a bias vector (one value per bin),
which corrects interaction counts for sequencing depth, mappability, and
other systematic effects.


<blockquote>
Important: During normalization, bad columns (bins with low
counts or poor mappability) are removed from the matrix.
</blockquote>


Several normalization strategies are available:



See <code>tadbit normalize --help</code> for details.



A common approach is to require a minimum number of counts per bin
and to explicitly exclude problematic genomic regions.


<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/${sample}" # working directory
res="100000" # resolution (100 kb)
norm="ICE"
min_count=5
</pre>


To exclude specific regions (e.g. sex chromosomes or poorly assembled
regions), use the <code>--badcols</code> option.



⏰ The normalization step should take approximately 2 minutes using 6 CPUs.


Task: Run normalization twice: once with <code>norm="ICE"</code>
and once with <code>norm="raw"</code>. Compare the results later.

<hr>

<h2>Contact matrices</h2>


After normalization, TADbit generates Hi-C contact matrices at the chosen
resolution.



These matrices represent interaction frequencies between genomic bins
and are the basis for downstream analyses such as TAD detection.


Q3: How does changing the resolution affect the appearance of the
contact matrix?

<hr>

Congratulations, you finished the TADbit exercise!

Program 2026

2026-01-06T14:36:36Z

Mick:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

 

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

 

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-10:00pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

 

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2025/dtu_adna_2025_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
 

 

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: TBA ([[ Microbial_genomics_exercise ]]) ([[ solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Questions]]) ([[Solution]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([])</DD>
<DD>David Duchene</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Exercise]]) ([[Solution]])</DD>
<DD> David Duchene</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([Lecture slides])</DD>
<DD>David Duchene</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Exercises]]) ([[Solution]]) </DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Metabarcoding Exercise]]) ([[Metabarcoding Solution]])</DD>
<DD> Luke Holman</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024.pdf Test 2025])([https://teaching.healthtech.dtu.dk/material/22126/2024/test_2024_withA.pdf answers])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
 

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

 

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Denovo exercise

2026-01-06T13:43:22Z

Mick:

<h2>Overview</h2>

First:
<ol>
<li>Navigate to your home directory.</li>
<li>Create a directory called <code>denovo</code>.</li>
<li>Navigate to the directory you just created.</li>
</ol>

In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:

<ol>
<li>Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).</li>
<li>Count k-mers and estimate genome size.</li>
<li>Correct reads using Musket.</li>
<li>Determine insert size of paired-end reads.</li>
<li>Run de novo assembly using MEGAHIT.</li>
<li>Calculate assembly statistics.</li>
<li>Plot coverage and length histograms of the assembly.</li>
<li>Evaluate the assembly quality.</li>
<li>Visualize the assembly using Circoletto.</li>
<li>(Bonus) Try assembling the genome with SPAdes.</li>
<li>Annotation of a prokaryotic genome.</li>
</ol>

<hr>

<h3>FastQC and trimming</h3>

Make sure you are in the <code>denovo</code> directory you created. You can double-check with:

<pre>
pwd
</pre>

Copy the sequencing data:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/* .
</pre>

Run FastQC on the reads:

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

Viewing FastQC HTML reports:

If you are using MobaXterm, you can open the FastQC HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_1_sequence_fastqc.html .
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_2_sequence_fastqc.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:

<pre>
/home/ctools/bin/fastx_detect_fq.sh Vchol-001_6_1_sequence.txt.gz
</pre>

Q1. Which quality encoding format is used?

Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The <code>--basename</code> option defines the output prefix and <code>--gzip</code> compresses the output.

<pre>
/home/ctools/adapterremoval-2.3.4/build/AdapterRemoval \
--file1 Vchol-001_6_1_sequence.txt.gz \
--file2 Vchol-001_6_2_sequence.txt.gz \
--adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
--adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
--qualitybase 64 \
--basename Vchol-001_6 \
--gzip \
--trimqualities \
--minquality 20 \
--minlength 40
</pre>

When it finishes, inspect <code>Vchol-001_6.settings</code> for trimming statistics (how many reads were trimmed, discarded, etc.).

Q1A. The output includes <code>discarded.gz</code>, <code>pair1.truncated.gz</code>, <code>pair2.truncated.gz</code>, and <code>singleton.truncated.gz</code>. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)

Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:

<pre>
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair1.truncated.gz
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair2.truncated.gz
</pre>

<hr>

<h3>Genome size estimation</h3>

We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram. (This may take some time to run so could be good time to practice using "screen")

Manual: [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]

<pre>
gzip -dc Vchol-001_6.pair*.truncated.gz \
| /home/ctools/jellyfish-2.3.1/bin/jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0

/home/ctools/jellyfish-2.3.1/bin/jellyfish histo Vchol-001 > Vchol-001.histo
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
dat <- read.table("Vchol-001.histo")

pdf("Vchol-001.histo.pdf")
barplot(dat[,2],
xlim = c(0,150),
ylim = c(0,5e5),
ylab = "No of kmers",
xlab = "Counts of a k-mer",
names.arg = dat[,1],
cex.names = 0.8)
dev.off()
</pre>

If you are using MobaXterm, you can open the pdf files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/Vchol-001.histo.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

The plot shows:
<ul>
<li>x-axis: how many times a k-mer occurs (its count)</li>
<li>y-axis: number of distinct k-mers with that count</li>
</ul>

K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.

Q2. Where is the k-mer coverage peak (approximately)?

We can estimate genome size using:

<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>

<ul>
<li>N = depth (coverage)</li>
<li>M = k-mer peak (from the histogram)</li>
<li>K = k-mer size (here: 15)</li>
<li>L = average read length (from fastx_readlength)</li>
<li>T = total number of bases (from fastx_readlength)</li>
</ul>

Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.

Q3. What is your estimated genome size?

<hr>

<h3>Error correction</h3>

We will correct errors in the reads using Musket.

Musket: [http://musket.sourceforge.net/homepage.htm Musket]

First, get the number of distinct k-mers (needed for memory allocation in Musket):

<pre>
/home/ctools/jellyfish-2.3.1/bin/jellyfish stats Vchol-001
</pre>

Use the reported number of distinct k-mers (here an example: <code>8423098</code>) in the Musket command:

<pre>
/home/ctools/musket-1.1/musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
</pre>

The output files are named <code>Vchol-001_6.cor.0</code> and <code>Vchol-001_6.cor.1</code>. Rename them:

<pre>
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz
</pre>

If this takes too long, you can copy precomputed corrected reads:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .
</pre>

<hr>

<h3>De novo assembly with MEGAHIT</h3>

We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.

First, set the number of threads:

<pre>
export OMP_NUM_THREADS=4
</pre>

Run MEGAHIT with k=35:

<pre>
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list 35 \
-t 4 \
-m 2000000000 \
-o 35
</pre>

When finished, you should have <code>35/final.contigs.fa</code>. Compress it:

<pre>
gzip 35/final.contigs.fa
</pre>

To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):

<pre>
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
</pre>

Index the assembly and map:

<pre>
bwa index 35/final.contigs.fa.gz

bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
| samtools view -Sb - > Vchol_35bp.bam
</pre>

Extract insert sizes (TLEN field, column 9):

<pre>
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean insert size
sd(a.v[a.v >= mn & a.v <= mx]) # standard deviation
</pre>

Q4. What are the mean insert size and standard deviation of the library?

Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:

[https://docs.google.com/spreadsheets/d/1trUMlSwNLoNW67D-OkgA93iOQRp2iioyJSBYyW30P4U/edit?usp=sharing Google sheet for k-mer assignment]

Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing <code>[KMER]</code> below:

<pre>
export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list [KMER] \
-t 4 \
-m 2000000000 \
-o [KMER]

gzip [KMER]/final.contigs.fa
</pre>

Compute assembly statistics using <code>QUAST</code>:

<pre>
python3 /home/ctools/quast/quast.py \
[KMER]/final.contigs.fa.gz \
--threads 1 \
-o [KMER]/quast
</pre>

Open the file <code>[KMER]/quast/report.txt</code> (or <code>report.tsv</code>) and
record the following values in the Google sheet for your k-mer:

<ul>
<li>Number of contigs (≥ 500 bp)</li>
<li>Total assembly length</li>
<li>Largest contig</li>
<li>N50</li>
</ul>

As a class, compare results across k-mer sizes and discuss which k-mer produces
the most reasonable assembly and why.

Copy the best assembly to your folder, or use a precomputed multi-k assembly:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .
</pre>

Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?

Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?

<hr>

<h3>Coverage of the assembly</h3>

We will now calculate per-contig coverage and lengths, and visualize them in R.

<pre>
zcat default_final.contigs.fa.gz | /home/ctools/bin/fastx_megahit.sh --i /dev/stdin > default_finalt.cov
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
library(plotrix)
dat <- read.table("default_finalt.cov", sep = "\t")

## ---- Coverage plots (2 panels) ----
pdf("best.coverage.pdf", width = 10, height = 5)
par(mfrow = c(1, 2))

weighted.hist(w = dat[,2],
x = dat[,1],
breaks = seq(0, 100, 1),
main = "Weighted coverage",
xlab = "Contig coverage")

hist(dat[,1],
xlim = c(0, 100),
breaks = seq(0, 1000, 1),
main = "Raw coverage",
xlab = "Contig coverage")

dev.off()

## ---- Scaffold lengths (1 panel) ----
pdf("scaffold.lengths.pdf", width = 7, height = 5)
par(mfrow = c(1, 1))

barplot(rev(sort(dat[,2])),
xlab = "# Scaffold",
ylab = "Length",
main = "Scaffold Lengths")

dev.off()
</pre>

View the plots:

Viewing the PDF files:

If you are using MobaXterm, you can open the PDF files directly from the
left-hand file panel.

If you are using macOS (or a standard terminal), copy the PDF files to your
local computer and open them with any PDF viewer. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/best.coverage.pdf .
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/scaffold.lengths.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.

The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.

Q7. Why might some short contigs have much higher coverage than the main assembly?

Q8. Why might some short contigs have much lower coverage than the main assembly?

<hr>

<h3>Assembly evaluation</h3>

We will use QUAST to evaluate the assembly using various reference-based metrics.

QUAST: [https://quast.sourceforge.net/quast quast]

Run QUAST against the V. cholerae reference genome:

<pre>
python3 /home/ctools/quast/quast.py \
default_final.contigs.fa.gz \
--threads 1 \
-R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
</pre>

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

If you are using MobaXterm, you can open the HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/quast_results/latest/report.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?

<hr>

<h3>Visualization using Circoletto</h3>

We will visualize the assembly against the V. cholerae reference using Circoletto.

First, filter out contigs shorter than 500 bp:

<pre>
/home/ctools/bin/fastx_filterfasta.sh default_final.contigs.fa.gz 500 > default_final.contigs_filtered_500.fa
</pre>

On your local machine, open a browser and go to:

[https://bat.infspire.org/circoletto/ Circoletto]

Open the filtered assembly in a text editor on the server, for example:

<pre>
gedit default_final.contigs_filtered_500.fa &
</pre>

Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.

Then open the reference genome:

<pre>
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
</pre>

Copy–paste this into the “Subject fasta” box.

In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.

If Circoletto does not work, you can use this precomputed image:

<pre>
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
</pre>

You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).

Q10. Does your assembled genome appear broadly similar to the reference genome?

Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?

Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: [https://www.nature.com/articles/35020000 V. cholerae genome paper]

<hr>

<h3>Try to assemble the genome using SPAdes (bonus)</h3>

Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.

SPAdes: [https://ablab.github.io/spades/ SPAdes]

Check the help output:

<pre>
python3 /home/ctools/SPAdes-4.2.0-Linux/bin/spades.py -h
</pre>

Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.

Link to the SPAdes assembly:

<pre>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST
</pre>

<h3>Annotation of a prokaryotic genome</h3>

We will annotate genes in <code>ecoli_pacbio.contigs.fasta</code> using prodigal.

Prodigal: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal]

The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:

<pre>
prodigal \
-f gff \
-i [input genome in fasta] \
-a [output proteins in fasta] \
-o [output annotations in gff]
</pre>

GFF format: [https://www.ensembl.org/info/website/upload/gff.html GFF format description]

Next, index the protein FASTA file:

<pre>
samtools faidx ecoli_pacbio.contigs.aa
</pre>

Extract the protein sequence for gene ID <code>tig00000001_4582</code>:

<pre>
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
</pre>

Use BLASTP against the nr database:

[https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins]

Paste the sequence and run BLASTP.

Q14. Which protein (function) does <code>tig00000001_4582</code> correspond to?

<hr>

Please find answers here: [[Denovo_solution|Denovo_solution]]

<hr>

Congratulations, you finished the exercise!

Denovo exercise

2026-01-06T13:41:40Z

Mick:

<h2>Overview</h2>

First:
<ol>
<li>Navigate to your home directory.</li>
<li>Create a directory called <code>denovo</code>.</li>
<li>Navigate to the directory you just created.</li>
</ol>

In this exercise we will perform a de novo assembly of Illumina paired-end reads. The data is from a Vibrio cholerae strain isolated in Nepal. You will:

<ol>
<li>Run FastQC and perform adapter/quality trimming (optional recap of pre-processing).</li>
<li>Count k-mers and estimate genome size.</li>
<li>Correct reads using Musket.</li>
<li>Determine insert size of paired-end reads.</li>
<li>Run de novo assembly using MEGAHIT.</li>
<li>Calculate assembly statistics.</li>
<li>Plot coverage and length histograms of the assembly.</li>
<li>Evaluate the assembly quality.</li>
<li>Visualize the assembly using Circoletto.</li>
<li>(Bonus) Try assembling the genome with SPAdes.</li>
<li>Annotation of a prokaryotic genome.</li>
</ol>

<hr>

<h3>FastQC and trimming</h3>

Make sure you are in the <code>denovo</code> directory you created. You can double-check with:

<pre>
pwd
</pre>

Copy the sequencing data:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/* .
</pre>

Run FastQC on the reads:

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

Viewing FastQC HTML reports:

If you are using MobaXterm, you can open the FastQC HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_1_sequence_fastqc.html .
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_2_sequence_fastqc.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

There are several issues with this dataset (you do not need to study the report in detail now). We will clean it up first. Let’s identify the quality encoding:

<pre>
/home/ctools/bin/fastx_detect_fq.sh Vchol-001_6_1_sequence.txt.gz
</pre>

Q1. Which quality encoding format is used?

Trim the reads using AdapterRemoval. The most frequent adapter/primer sequences are already included below. We use a minimum read length of 40 nt, trim to quality 20, and specify quality base 64. The <code>--basename</code> option defines the output prefix and <code>--gzip</code> compresses the output.

<pre>
/home/ctools/adapterremoval-2.3.4/build/AdapterRemoval \
--file1 Vchol-001_6_1_sequence.txt.gz \
--file2 Vchol-001_6_2_sequence.txt.gz \
--adapter1 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATATCGTATGC \
--adapter2 GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG \
--qualitybase 64 \
--basename Vchol-001_6 \
--gzip \
--trimqualities \
--minquality 20 \
--minlength 40
</pre>

When it finishes, inspect <code>Vchol-001_6.settings</code> for trimming statistics (how many reads were trimmed, discarded, etc.).

Q1A. The output includes <code>discarded.gz</code>, <code>pair1.truncated.gz</code>, <code>pair2.truncated.gz</code>, and <code>singleton.truncated.gz</code>. What types of reads does each file contain? (Tip: check the AdapterRemoval documentation.)

Next, compute basic read stats (average read length, min/max length, number of reads, total bases) for the trimmed paired reads. Note down the average read length and total number of bases:

<pre>
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair1.truncated.gz
/home/ctools/bin/fastx_readlength.sh Vchol-001_6.pair2.truncated.gz
</pre>

<hr>

<h3>Genome size estimation</h3>

We will count k-mers in the data. A k-mer is simply a DNA word of length k. We use jellyfish to count 15-mers. We combine counts from forward and reverse-complement strands and then create a histogram. (This may take some time to run so could be good time to practice using "screen")

Manual: [http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual.html jellyfish]

<pre>
gzip -dc Vchol-001_6.pair*.truncated.gz \
| /home/ctools/jellyfish-2.3.1/bin/jellyfish count -t 2 -m 15 -s 1000000000 -o Vchol-001 -C /dev/fd/0

/home/ctools/jellyfish-2.3.1/bin/jellyfish histo Vchol-001 > Vchol-001.histo
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
dat <- read.table("Vchol-001.histo")

pdf("Vchol-001.histo.pdf")
barplot(dat[,2],
xlim = c(0,150),
ylim = c(0,5e5),
ylab = "No of kmers",
xlab = "Counts of a k-mer",
names.arg = dat[,1],
cex.names = 0.8)
dev.off()
</pre>

If you are using MobaXterm, you can open the pdf files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_1_sequence_fastqc.html .
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/fastqc/Vchol-001_6_2_sequence_fastqc.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

The plot shows:
<ul>
<li>x-axis: how many times a k-mer occurs (its count)</li>
<li>y-axis: number of distinct k-mers with that count</li>
</ul>

K-mers that occur only a few times are typically due to sequencing errors. K-mers forming the main peak (higher counts) are likely “real” and can be used for error correction and genome size estimation.

Q2. Where is the k-mer coverage peak (approximately)?

We can estimate genome size using:

<pre>
N = (M * L) / (L - K + 1)
Genome_size = T / N
</pre>

<ul>
<li>N = depth (coverage)</li>
<li>M = k-mer peak (from the histogram)</li>
<li>K = k-mer size (here: 15)</li>
<li>L = average read length (from fastx_readlength)</li>
<li>T = total number of bases (from fastx_readlength)</li>
</ul>

Compute the estimated genome size and compare with the known V. cholerae genome (~4 Mb). You should be within roughly ±10%.

Q3. What is your estimated genome size?

<hr>

<h3>Error correction</h3>

We will correct errors in the reads using Musket.

Musket: [http://musket.sourceforge.net/homepage.htm Musket]

First, get the number of distinct k-mers (needed for memory allocation in Musket):

<pre>
/home/ctools/jellyfish-2.3.1/bin/jellyfish stats Vchol-001
</pre>

Use the reported number of distinct k-mers (here an example: <code>8423098</code>) in the Musket command:

<pre>
/home/ctools/musket-1.1/musket -k 15 8423098 -p 1 -omulti Vchol-001_6.cor -inorder \
Vchol-001_6.pair1.truncated.gz Vchol-001_6.pair2.truncated.gz -zlib 1
</pre>

The output files are named <code>Vchol-001_6.cor.0</code> and <code>Vchol-001_6.cor.1</code>. Rename them:

<pre>
mv Vchol-001_6.cor.0 Vchol-001_6.pair1.cor.truncated.fq.gz
mv Vchol-001_6.cor.1 Vchol-001_6.pair2.cor.truncated.fq.gz
</pre>

If this takes too long, you can copy precomputed corrected reads:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/vchol/corrected/Vchol-001_6.pair*.cor.truncated.fq.gz .
</pre>

<hr>

<h3>De novo assembly with MEGAHIT</h3>

We will now assemble the corrected reads using MEGAHIT (a de Bruijn graph assembler). K-mer size is critical: MEGAHIT can test multiple k-mers by default, but here we start with a fixed k-mer size of 35.

First, set the number of threads:

<pre>
export OMP_NUM_THREADS=4
</pre>

Run MEGAHIT with k=35:

<pre>
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list 35 \
-t 4 \
-m 2000000000 \
-o 35
</pre>

When finished, you should have <code>35/final.contigs.fa</code>. Compress it:

<pre>
gzip 35/final.contigs.fa
</pre>

To estimate insert size, we will map a subset of reads back to the assembly (similar to the alignment exercise). We’ll subsample the first 100,000 read pairs (400,000 lines per FASTQ):

<pre>
zcat Vchol-001_6.pair1.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_1.fastq
zcat Vchol-001_6.pair2.cor.truncated.fq.gz | head -n 400000 > Vchol_sample_2.fastq
</pre>

Index the assembly and map:

<pre>
bwa index 35/final.contigs.fa.gz

bwa mem 35/final.contigs.fa.gz Vchol_sample_1.fastq Vchol_sample_2.fastq \
| samtools view -Sb - > Vchol_35bp.bam
</pre>

Extract insert sizes (TLEN field, column 9):

<pre>
samtools view Vchol_35bp.bam | cut -f9 > initial.insertsizes.txt
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1] > 0, 1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean insert size
sd(a.v[a.v >= mn & a.v <= mx]) # standard deviation
</pre>

Q4. What are the mean insert size and standard deviation of the library?

Next, we will explore different k-mer sizes. Each student chooses a different k-mer from this Google sheet:

[https://docs.google.com/spreadsheets/d/1trUMlSwNLoNW67D-OkgA93iOQRp2iioyJSBYyW30P4U/edit?usp=sharing Google sheet for k-mer assignment]

Write your name next to the k-mer you select, then run MEGAHIT with that k-mer, replacing <code>[KMER]</code> below:

<pre>
export OMP_NUM_THREADS=4
/home/ctools/MEGAHIT-1.2.9-Linux-x86_64-static/bin/megahit \
-1 Vchol-001_6.pair1.cor.truncated.fq.gz \
-2 Vchol-001_6.pair2.cor.truncated.fq.gz \
--k-list [KMER] \
-t 4 \
-m 2000000000 \
-o [KMER]

gzip [KMER]/final.contigs.fa
</pre>

Compute assembly statistics using <code>QUAST</code>:

<pre>
python3 /home/ctools/quast/quast.py \
[KMER]/final.contigs.fa.gz \
--threads 1 \
-o [KMER]/quast
</pre>

Open the file <code>[KMER]/quast/report.txt</code> (or <code>report.tsv</code>) and
record the following values in the Google sheet for your k-mer:

<ul>
<li>Number of contigs (≥ 500 bp)</li>
<li>Total assembly length</li>
<li>Largest contig</li>
<li>N50</li>
</ul>

As a class, compare results across k-mer sizes and discuss which k-mer produces
the most reasonable assembly and why.

Copy the best assembly to your folder, or use a precomputed multi-k assembly:

<pre>
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.fa.gz .
cp /home/projects/22126_NGS/exercises/denovo/best/default_final.contigs.stats .
</pre>

Q5. How does the N50 of the best assembly (multi-k or default) compare to the N50 from the fixed-k assemblies?

Q6. How does the longest contig length compare between fixed-k and multi-k/default assemblies?

<hr>

<h3>Coverage of the assembly</h3>

We will now calculate per-contig coverage and lengths, and visualize them in R.

<pre>
zcat default_final.contigs.fa.gz | /home/ctools/bin/fastx_megahit.sh --i /dev/stdin > default_finalt.cov
</pre>

Start R:

<pre>
R
</pre>

Then paste:

<pre>
library(plotrix)
dat <- read.table("default_finalt.cov", sep = "\t")

## ---- Coverage plots (2 panels) ----
pdf("best.coverage.pdf", width = 10, height = 5)
par(mfrow = c(1, 2))

weighted.hist(w = dat[,2],
x = dat[,1],
breaks = seq(0, 100, 1),
main = "Weighted coverage",
xlab = "Contig coverage")

hist(dat[,1],
xlim = c(0, 100),
breaks = seq(0, 1000, 1),
main = "Raw coverage",
xlab = "Contig coverage")

dev.off()

## ---- Scaffold lengths (1 panel) ----
pdf("scaffold.lengths.pdf", width = 7, height = 5)
par(mfrow = c(1, 1))

barplot(rev(sort(dat[,2])),
xlab = "# Scaffold",
ylab = "Length",
main = "Scaffold Lengths")

dev.off()
</pre>

View the plots:

Viewing the PDF files:

If you are using MobaXterm, you can open the PDF files directly from the
left-hand file panel.

If you are using macOS (or a standard terminal), copy the PDF files to your
local computer and open them with any PDF viewer. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/best.coverage.pdf .
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/scaffold.lengths.pdf .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.

The left plot shows length-weighted coverage: long contigs contribute more to the histogram. The right plot shows the raw distribution of contig coverage. Typically, most of the assembly will cluster around the expected coverage (e.g. ~60–90×), and shorter contigs will have more variable coverage. The scaffold length plot shows that most of the assembled bases are in relatively long scaffolds.

Q7. Why might some short contigs have much higher coverage than the main assembly?

Q8. Why might some short contigs have much lower coverage than the main assembly?

<hr>

<h3>Assembly evaluation</h3>

We will use QUAST to evaluate the assembly using various reference-based metrics.

QUAST: [https://quast.sourceforge.net/quast quast]

Run QUAST against the V. cholerae reference genome:

<pre>
python3 /home/ctools/quast/quast.py \
default_final.contigs.fa.gz \
--threads 1 \
-R /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa
</pre>

<pre>
mkdir fastqc
/home/ctools/FastQC/fastqc -o fastqc *.txt.gz
</pre>

If you are using MobaXterm, you can open the HTML files directly
from the left-hand file panel on the server.

If you are using macOS (or a standard terminal), copy the HTML files to
your local computer and open them in a web browser. For example:

<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:denovo/quast_results/latest/report.html .
</pre>

Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on. The files will be copied to your current local
directory.

Q9. The report lists several misassemblies. Can we always fully trust these “misassembly” calls? Why or why not?

<hr>

<h3>Visualization using Circoletto</h3>

We will visualize the assembly against the V. cholerae reference using Circoletto.

First, filter out contigs shorter than 500 bp:

<pre>
/home/ctools/bin/fastx_filterfasta.sh default_final.contigs.fa.gz 500 > default_final.contigs_filtered_500.fa
</pre>

On your local machine, open a browser and go to:

[https://bat.infspire.org/circoletto/ Circoletto]

Open the filtered assembly in a text editor on the server, for example:

<pre>
gedit default_final.contigs_filtered_500.fa &
</pre>

Copy–paste the FASTA content into the “Query fasta” box on the Circoletto page.

Then open the reference genome:

<pre>
gedit /home/projects/22126_NGS/exercises/denovo/reference/vibrio_cholerae_O1_N16961.fa &
</pre>

Copy–paste this into the “Subject fasta” box.

In the “Output” section, select “ONLY show the best hit per query”, then click Submit to Circoletto.

If Circoletto does not work, you can use this precomputed image:

<pre>
/home/projects/22126_NGS/exercises/denovo/circoletto_results/cl0011524231.blasted.png
</pre>

You should see the two V. cholerae chromosomes on the left (labelled with “gi|…”) and the alignment of your contigs to these chromosomes. Colours represent BLAST bitscores (red = high confidence, black = low).

Q10. Does your assembled genome appear broadly similar to the reference genome?

Q11. Are there contigs/scaffolds that do not map, or only partially map, to the reference?

Q12. On chromosome 2 (the smaller chromosome), there may be a region with many short, low-confidence hits. What might this region represent? Hint: see the V. cholerae genome paper and search for “V. cholerae integron island”: [https://www.nature.com/articles/35020000 V. cholerae genome paper]

<hr>

<h3>Try to assemble the genome using SPAdes (bonus)</h3>

Different assemblers can perform very differently. SPAdes is widely used and generally performs well. It performs error correction and uses multiple k-mer sizes internally.

SPAdes: [https://ablab.github.io/spades/ SPAdes]

Check the help output:

<pre>
python3 /home/ctools/SPAdes-4.2.0-Linux/bin/spades.py -h
</pre>

Note: A full SPAdes run may take ~45 minutes. You can use the precomputed SPAdes assembly instead and compare to MEGAHIT using QUAST and Assemblathon stats.

Link to the SPAdes assembly:

<pre>
ln -s /home/projects/22126_NGS/exercises/denovo/vchol/spades/spades.fasta spades.fasta
# from here you can compute stats and run QUAST
</pre>

<h3>Annotation of a prokaryotic genome</h3>

We will annotate genes in <code>ecoli_pacbio.contigs.fasta</code> using prodigal.

Prodigal: [https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-119 prodigal]

The output will be a GFF file with gene coordinates and a FASTA file with predicted proteins:

<pre>
prodigal \
-f gff \
-i [input genome in fasta] \
-a [output proteins in fasta] \
-o [output annotations in gff]
</pre>

GFF format: [https://www.ensembl.org/info/website/upload/gff.html GFF format description]

Next, index the protein FASTA file:

<pre>
samtools faidx ecoli_pacbio.contigs.aa
</pre>

Extract the protein sequence for gene ID <code>tig00000001_4582</code>:

<pre>
samtools faidx ecoli_pacbio.contigs.aa tig00000001_4582
</pre>

Use BLASTP against the nr database:

[https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastp&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome BLAST for proteins]

Paste the sequence and run BLASTP.

Q14. Which protein (function) does <code>tig00000001_4582</code> correspond to?

<hr>

Please find answers here: [[Denovo_solution|Denovo_solution]]

<hr>

Congratulations, you finished the exercise!

Ancient DNA exercise

2026-01-06T13:38:57Z

Mick:

<H2>Overview</H2>

Adapted from Martin Sikora.

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "adna"
<LI>Navigate to the directory you just created.
</OL>

We will try to
# Authenticate ancient DNA
# do some basic population genetics

<h2> Data authentication</h2>

Authentication involves making sure that the DNA that you have extracted from my fossil and sequenced is indeed from the fossil and not some modern contaminant. A big difference between modern DNA and ancient DNA is the presence of chemical damage due to the passage of time.

<h3> Direct measurements of the rate of chemical damage</h3>

First, create a directory:
<pre>
mkdir 01_authentication
cd 01_authentication
</pre>

We will characterize DNA damage patterns using mapDamage, a software to estimate the rate of nucleotide substitution. In this section, we will examine some example BAM files for the presence of DNA damage patterns typical of ancient DNA.

We have a set of 10 modern and 26 ancient individuals (subsampled to 100k reads)
<pre>
find /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ -name "*bam"
</pre>

First, run mapDamage on one of the modern individuals:

<pre>
mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/modern/NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.bam -r /home/databases/references/human/hs37d5.fa --no-stats
</pre>

Examine the output (either via mobaxterm or by downloading it locally):

<pre>
cd NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.mapDamage/
Length_plot.pdf
Fragmisincorporation_plot.pdf
cd ..
</pre>

'''Q1:''' which fragment length occurs most frequently?

'''Q2:''' what is the frequency of 5' C>T and 3' G>A substitutions ()

Run mapDamage on one of the ancient individuals
<pre>
mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ancient/allentoft_2015/RISE559.sort.rmdup.realign.md.100k.bam -r /home/databases/references/human/hs37d5.fa --no-stats
</pre>

Examine the output (either via mobaxterm or by downloading it locally)
<pre>
cd RISE559.sort.rmdup.realign.md.100k.mapDamage/
Length_plot.pdf &
Fragmisincorporation_plot.pdf &
</pre>

'''Q3:''' At what fragment length does the distribution show its peak?

'''Q4:''' what are the frequencies of 5' C>T (red line) and 3' G>A substitutions (blue line)?

'''Q5:''' which bases are enriched at 5' flanking position?

'''Q6:''' does your sample look ancient? if not, what might be the reason?

<H2> Population genetics </H2>

Create a new subdirectory and navigate to it:
<pre>
cd ..
mkdir 02_popgen
cd 02_popgen
</pre>

<H3>Explore the reference panel dataset</H3>

Pur reference panel dataset is in binary PLINK format, a widely used format in genetic studies (see documentation [https://www.cog-genomics.org/plink/1.9/ here]). We need to access the following files:

<pre>
ls /home/projects/22126_NGS/exercises/adna/02_popgen/plink/
</pre>

However, instead of copying them, we will create symbolic links using the ln command, these acts as placeholders and tell the operating system to pretend that there is an actual file there. This saves considerable disk space compared to copying over the files.

<pre>
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bed .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bim .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.cluster .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.fam .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.sampleInfo.txt .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/eur.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/modern.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/noneur.poplist .
</pre>

The PLINK binary format consists of 3 files:

{| class="wikitable"
| '''file'''
| '''description'''
|-
| world.bed
| | genotype data in binary format ('''not to be confused with genomic intervals bed file but it is confusing''')
|-
| world.bim
| metadata for the variants, 1 line per variant
|-
| world.fam
| metadata for the samples, 1 line per sample
|-

We also have the following files than contain extra information:

{| class="wikitable"
| '''file'''
| '''description'''
|-
|world.cluster
| pre-defined population groupings for samples (for plink)
|-
| world.sampleInfo.txt
| additional sample metadata (for plotting etc)
|}

Let us explore the metadata files:

<pre>
head world.fam
head world.bim
head world.cluster
head world.sampleInfo.txt
</pre>

'''Q7:''' How many samples / SNPs are in our dataset?

'''Q8:''' what populations are in our reference panel and what sample size do they have (trick: forgo the header using "tail -n+2", you need "sort" and uniq (prints 1 instance per repeated line), to tell "uniq" to count and print how many lines were repeated "-c"?

Calculate basic summary statistics (a simple description of the data) for the dataset:

<pre>
/home/ctools/plink --bfile world --missing --out world
</pre>

'''Q9:''' are you getting the same number of variants and individuals as you did via UNIX command lines?

The world.imiss file lists the number and fraction of missing genotypes for each sample

'''Q10:''' what fraction of SNPs have a missing genotype for the Tyrolean Iceman?

<H3>Genotype and merge an ancient individual</H3>

In this section, we will merge our ancient data with the reference panel to prepare our dataset for downstream analysis genotypes for our ancient data will be obtained by randomly sampling a read from the alignments (BAM files) at the reference dataset SNP positions.

We are going to use a low-coverage individual from [https://pubmed.ncbi.nlm.nih.gov/26062507/ Allentoft et al (RISE507)], this data was obtained from an ~5100-year-old individual from the Early Bronze Age [https://en.wikipedia.org/wiki/Afanasievo_culture Afanasievo culture] in the Altai Mountains region

<pre>
ls /home/projects/22126_NGS/exercises/adna/02_popgen/bam/
</pre>

First, we need to extract a genomic interval bed file for the SNP positions of the reference panel:
<pre>
awk '{print $1"\t"($4-1)"\t"$4}' world.bim | gzip > world.snps.bed.gz
</pre>

awk is a command to create small programs. In this example, we tell it, print the first columns, the fourth column minus 1 and the fourth column again.

Inspect the results:

<pre>
zcat world.snps.bed.gz | head
</pre>

Create a read pileup file for the reference panel SNP positions (might take a few minutes)

<pre>
samtools mpileup -f /home/databases/references/human/hs37d5.fa -B -l world.snps.bed.gz /home/projects/22126_NGS/exercises/adna/02_popgen/bam/RISE507.sort.rmdup.realign.md.bam |gzip > RISE507.pileup.gz
</pre>

Examine the output:

<pre>
zcat RISE507.pileup.gz |head
</pre>

'''Q11''': how many SNPs of the reference panel are covered in RISE507?

Now we will randomly sample a DNA fragment at each position and output the results in VCF format (custom python script):
<pre>
zcat RISE507.pileup.gz | /home/ctools/Python-2.7.18/bin/python2.7 /home/projects/22126_NGS/exercises/adna/02_popgen/get_haploid_vcf_from_pileup.py -r -s RISE507 |/home/ctools/htslib-1.20/bgzip -c > RISE507.vcf.gz
</pre>
This is done because the coverage is insufficient to ensure proper genotyping.

Let us inspect the result:
<pre>
zcat RISE507.vcf.gz |grep -v "^#" |head
</pre>

We convert to plink binary format:
<pre>
/home/ctools/plink --vcf RISE507.vcf.gz --make-bed --double-id --out RISE507
</pre>

Try to merge the sample with the reference panel
<pre>
/home/ctools/plink --bfile world --bmerge RISE507 --out RISE507.merge
</pre>

You should get an error.

'''Q12''': how many SNPs failed the merge? What is the likely reason?

We will remove the failing SNPs and try again
<pre>
/home/ctools/plink --bfile RISE507 --exclude RISE507.merge.missnp --make-bed --out RISE507.merge2
/home/ctools/plink --bfile world --bmerge RISE507.merge2 --out RISE507.world
</pre>

Make a cluster file for subsetting
<pre>
awk '{print $1,$2,$1}' RISE507.world.fam > RISE507.world.cluster
</pre>

<H3>Investigate the genetic affinities of the ancient sample using PCA</H3>

In this section, we will try to place our sample within a PCA of a set of modern and ancient individuals.

First, we will have a look at the modern populations in the reference panel:
<pre>
/home/ctools/plink --bfile RISE507.world --keep-clusters modern.poplist --within RISE507.world.cluster --pca header tabs --out modern
</pre>

We can plot the first two principal components using the custom R script plotPca.R

The three positional arguments are the eigenvector file, sample info file and prefix for the output (view the pdf either via mobaxterm or by downloading it locally):

<pre>
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R modern.eigenvec world.sampleInfo.txt modern
modern.pca.plot.pdf
</pre>

'''Q13:''' which populations are most differentiated along PC1?
'''Q14:''' which populations are most differentiated along PC2?

We repeat the exercise on a subset of European populations (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --keep-clusters eur.poplist --within RISE507.world.cluster --pca header tabs --out eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R eur.eigenvec world.sampleInfo.txt eur
eur.pca.plot.pdf
</pre>

'''Q15:''' which populations are most differentiated along PC1?
'''Q16:''' which populations are most differentiated along PC2?

Now, let us examine how the cluster of ancient individuals compared to the modern ones (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --pca header tabs --out ancient.world
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.world.eigenvec world.sampleInfo.txt ancient.world
ancient.world.pca.plot.pdf
</pre>

Here are some references if you want to read more about the different ancient samples:

{| class="wikitable"
| '''sample'''
| '''link'''
|-
| UstIshim
| [https://en.wikipedia.org/wiki/Ust%27-Ishim_man]
|-
| Loschbour
| [https://en.wikipedia.org/wiki/Loschbour_man] [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/]
|-
| Brana
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4269527/]
|-
| NE1
| [https://www.pnas.org/content/113/2/368]
|-
|Stuttgart
| [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4170574/]
|-
| Iceman
| [https://www.iceman.it/en/the-iceman/]
|-
|Karelia
| [https://en.wikipedia.org/wiki/Karelians]
|-
| Samara
| [https://en.wikipedia.org/wiki/Samara_culture]
|-
| MA1
| [https://en.wikipedia.org/wiki/Mal%27ta%E2%80%93Buret%27_culture]
|-
| RISE507
|[https://pubmed.ncbi.nlm.nih.gov/26062507/]
|}

'''Q17:''' which ancient individuals don't cluster close to any modern individuals? what could be a plausible reason?

Repeat the exercise but remove the non-European modern individuals (view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --within RISE507.world.cluster --remove-clusters noneur.poplist --pca header tabs --out ancient.eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.eur.eigenvec world.sampleInfo.txt ancient.eur
ancient.eur.pca.plot.pdf
</pre>

'''Q18:''' which populations are most differentiated along PC1? what could be a plausible reason?

As a final exercise, we now project the ancient individual on PCs inferred from modern Europeans(view the pdf either via mobaxterm or by downloading it locally):

<pre>
/home/ctools/plink --bfile RISE507.world --within RISE507.world.cluster --pca-clusters eur.poplist --remove-clusters noneur.poplist --pca header tabs --out ancient_proj.eur --maf 0.01
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient_proj.eur.eigenvec world.sampleInfo.txt ancient_proj.eur
ancient_proj.eur.pca.plot.pdf
</pre>

'''Q19:''' where does our study individual cluster now?

'''Q20:''' How do you explain that an individual that is found closer to the modern-day Chinese border is closer to modern Europeans than he is to the Han Chinese?

Please find answers [[Ancient_DNA_exercise_answers|here]]