22126 - User contributions [en]

Microbial genomics exercise

2026-01-12T13:12:14Z

Rasmus:

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results_from_teacher</code>:
<pre>
DO NOT RUN THIS gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb DO NOT RUN THIS
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative to the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the five ST2326 genome assemblies using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna > ~/microbial_exercise/output_abricate.txt
</pre>

==== Question 1 ====

Are the same AMR genes found in the five genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No

</div>

==== Question 2 ====

Do all found AMR genes have a good match to a database reference gene ? (sequence coverage of > 90% and sequence identity > 95%)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No, three gene calls fails the criteria for a good match. Following command can be used the find the three gene calls:
<pre>
cat ~/microbial_exercise/output_abricate.txt | cut -f1,6,10,11 | awk '$3<=90 || $4<=95'
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Microbial genomics exercise

2026-01-12T12:58:32Z

Rasmus:

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results_from_teacher</code>:
<pre>
DO NOT RUN THIS gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb DO NOT RUN THIS
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the five ST2326 genome assemblies using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna > ~/microbial_exercise/output_abricate.txt
</pre>

==== Question 1 ====

Are the same AMR genes found in the five genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No

</div>

==== Question 2 ====

Do all found AMR genes have a good match to a database reference gene ? (sequence coverage of > 90% and sequence identity > 95%)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No, three gene calls fails the criteria for a good match. Following command can be used the find the three gene calls:
<pre>
cat ~/microbial_exercise/output_abricate.txt | cut -f1,6,10,11 | awk '$3<=90 || $4<=95'
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Program 2026

2026-01-12T11:52:45Z

Rasmus:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

<BR>

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

<BR>

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<BR>

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<BR>

<BR>

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-12:00pm</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
<BR>

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-12T11:51:47Z

Rasmus:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

<BR>

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

<BR>

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<BR>

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<BR>

<BR>

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-11:55am</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
<BR>

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Microbial genomics exercise

2026-01-12T10:05:30Z

Rasmus: /* Task02: Species-level taxonomic classification of bacterial genomes */

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results_from_teacher</code>:
<pre>
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the five ST2326 genome assemblies using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna > ~/microbial_exercise/output_abricate.txt
</pre>

==== Question 1 ====

Are the same AMR genes found in the five genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No

</div>

==== Question 2 ====

Do all found AMR genes have a good match to a database reference gene ? (sequence coverage of > 90% and sequence identity > 95%)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No, three gene calls fails the criteria for a good match. Following command can be used the find the three gene calls:
<pre>
cat ~/microbial_exercise/output_abricate.txt | cut -f1,6,10,11 | awk '$3<=90 || $4<=95'
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Program 2026

2026-01-12T09:56:19Z

Rasmus:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

<BR>

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

<BR>

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<BR>

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<BR>

<BR>

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-11:55am</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
<BR>

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-12T09:55:18Z

Rasmus:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

<BR>

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

<BR>

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<BR>

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<BR>

<BR>

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-11:10am</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:10pm-11:55am</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
<BR>

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Program 2026

2026-01-12T09:51:49Z

Rasmus:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

<BR>

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

<BR>

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<BR>

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<BR>

<BR>

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/6/6d/Microbial_genomics_course22126_slide31-75_compressed.pdf Lecture slides 31-75])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
<BR>

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

File:Microbial genomics course22126 slide31-75 compressed.pdf

2026-01-12T09:50:54Z

Rasmus:

Microbial genomics exercise

2026-01-12T09:46:17Z

Rasmus:

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results/output_gtdb_from_teacher</code>:
<pre>
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb
mv ~/microbial_exercise/output_gtdb /home/projects/microbial_genomics/premade_task_results/output_gtdb_from_teacher
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the five ST2326 genome assemblies using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna > ~/microbial_exercise/output_abricate.txt
</pre>

==== Question 1 ====

Are the same AMR genes found in the five genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No

</div>

==== Question 2 ====

Do all found AMR genes have a good match to a database reference gene ? (sequence coverage of > 90% and sequence identity > 95%)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No, three gene calls fails the criteria for a good match. Following command can be used the find the three gene calls:
<pre>
cat ~/microbial_exercise/output_abricate.txt | cut -f1,6,10,11 | awk '$3<=90 || $4<=95'
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Program 2026

2026-01-12T09:38:40Z

Rasmus:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

<BR>

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

<BR>

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<BR>

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<BR>

<BR>

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>Use of next-generation (genome) sequencing in clinical microbiology ([https://teaching.healthtech.dtu.dk/22126/images/7/77/Microbial_genomics_course22126_slide1-30_compressed.pdf Lecture slides 1-30])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DD> The same exercise page will be used throughout the day: ([[Exercise]]) ([[Solution]])</DD>

<DL>
<DT>9:00am-9:55am</DT>
<DD>Lecture 1 + exercise: Tree thinking ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1_Tree_thinking.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>10:00am-10:55am</DT>
<DD>Lecture 2 + exercise: Data ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2_Data.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>11:00am-11:55am</DT>
<DD>Lecture 3 + exercise: Basic methods ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3_Basic_methods.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00am-1:55am</DT>
<DD>Lecture 4 + exercise: Models and support ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4_Models_and_support.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:00am-2:55am</DT>
<DD>Lecture 5 + exercise: Gene trees ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_5_Gene_trees.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>3:00am-3:55am</DT>
<DD>Lecture 6 + exercise: Molecular dating ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_6_MolecularDating.pdf Lecture slides])</DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([https://github.com/leholman/25.DTUmetabarcodingExercise/blob/main/README.md Metabarcoding Exercises]) </DD>
<DD> Luke Holman, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury</DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
<BR>

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

File:Microbial genomics course22126 slide1-30 compressed.pdf

2026-01-12T09:34:18Z

Rasmus:

Program 2026

2026-01-11T21:45:52Z

Rasmus:

'''NOTE: THIS PAGE IS UNDER CONSTRUCTION WITH A NEW TEACHER IN 2026'''

'''REMEMBER TO BRING A LAPTOP FOR EXERCISES'''

Lectures will be in person in building [https://maps.app.goo.gl/wH5EW199wrChCmWK7 341] in auditorium 23.

Lectures and exercises will take place on Discord (https://discord.gg/Qgw9M3SZA5). Please register with your full name. Will use Discord for online classes and collaboration with your project partners. Rather than emailing questions to the teaching staff, I encourage you to post your questions on discord.

The course has two main parts, the first half is lectures and exercises and the last half is project work ending with the exam on '''Friday 23rd of January 2026'''.

'''For the laptop ''' if you have a secure laptop (e.g. work laptop from Statens Serum Institut for instance), please bring your personal laptop.

=== Course Program - January 2026 ===

<HR>
'''Monday, January 5 (Day 1)'''
<HR>
''Introduction - Next Generation Sequencing''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Introduction to course
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-1-Intro.pdf Lecture slides])
</DD>
<DD>Mick Westbury</DD>

<DT>9:30am-10:00am</DT>
<dd>Introduction to NGS
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-2-NGS_basics.pdf Lecture slides]) </DD>
<DD>Mick Westbury</DD>

<DT>10:00am-10:45am</DT>
<DD>The NGS revolution
([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_1-3-NGS_revolution.pdf Lecture slides])</DD>
<DD>Mick Westbury</DD>

<DT>10:45am-11:00am</DT>
<DD>''Break''</DD>

<DT>11:00am-12:00pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>12:00pm-1:00pm</DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-1:30pm</DT>
<DD>Exercise: Logging on to our pupil servers ([[Logging on to pupil system]])</DD>
<DD>Mick Westbury , Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-2:15pm </DT>
<DD>Introduction to UNIX </DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>2:15pm-2:30pm</DT>
<DD>''Break''</DD>

<DT>2:30pm-3:30pm </DT>
<DD>Introduction to UNIX (continued)</DD>
<DD>([https://teaching.healthtech.dtu.dk/22113/index.php/Unix Video lectures to watch from "Unix intro.." to "Touching upon..."])</DD>
<DD>([[Unix Exercises|Unix exercises]] – possible answers [[Unix_answers|here]])
([[Basic UNIX notes]])
([[Advanced UNIX and Pipes]])</DD>
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>3:30pm-4:00pm </DT>
<DD>First look at data
([[First look exercise]])
<DD>Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

</DL>

<BR>

<HR>
'''Tuesday, January 6 (Day 2)'''
<HR>
''Data pre-processing & Alignment''

<DL>
<DT>9:00am-9:45am </DT>
<DD>Data basics ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-1-Data_basics.pdf Lecture slides]) ([[Data basics exercise]]) ([[Data basics exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm </DT>
<DD>Data pre-processing ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-2-QC_preprocessing.pdf Lecture slides]) ([[Data Preprocess exercise]]) ([[Data Preprocess exercise answers]])</DD>
<DD> Mick Westbury </DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Alignment ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_2-3-Alignment.pdf Lecture slides]) </DD>
<DD> Mick Westbury </DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break'' </DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: Alignment ([[Alignment exercise]]) ([[Alignment exercise answers]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<DL>

<BR>

<HR>
'''Wednesday, January 7 (Day 3)'''
<HR>

''Variant calling ''
<DL>
<DT>9:00am-9:30am</DT>
<DD>Functional Variation</DD>
<DD> Mick Westbury, ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-1-Functional_variation.pdf Lecture slides])</DD>

<DT>9:30am-10:15am</DT>
<DD>Variant calling part 1 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-2-Preprocessing-variant_calling.pdf Lecture slides])</DD>

<DD> Mick Westbury</DD>

<DT>10:15am-10:30am</DT>
<DD>''Break''</DD>

<DT>10:30am-12:00pm</DT>
<DD>Exercise: Preprocessing ([[Postprocess exercise]]) ([[Postprocess_exercise_answers]])</DD>
<DD>Exercise: variant calling part 1 ([[SNP calling exercise part 1]]) ([[SNP_calling_exercise_answers part 1]])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm </DT>
<DD>Lecture: variant calling part 2 ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_3-3-Variant_filtering.pdf Lecture slides])</DD>
<DD> Mick Westbury</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: variant calling part 2 ([[SNP calling exercise part 2]]) ([[SNP_calling_exercise_answers part 2]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 8 (Day 4)'''
<HR>
''Assembly, annotation and RNA-seq''

<DL>

<DT>9:00am-9:45pm</DT>
<DD>Lecture: de novo assembly and genomic annotations ([https://teaching.healthtech.dtu.dk/material/22126/2026/Lecture_4-1-Denovo.pdf Lecture slides]) </DD>
<DD> Mick Westbury</DD>

<DT>9:45pm-10:00pm</DT>
<DD>''Break''</DD>

<DT>10:00pm-12:00pm</DT>
<DD>Exercise: de novo assembly ([[denovo exercise]]) ([[denovo solution]]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: RNAseq ([https://teaching.healthtech.dtu.dk/material/22126/2024/ngs_transcriptomics_kvs_2023_without_solutions_v2.pdf Lecture slides])

<DD>Kristoffer Vitting-Seerup</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: RNAseq ([[Rnaseq_exercise]])  </DD>
<DD>Kristoffer Vitting-Seerup, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<BR>

<HR>
'''Friday, January 9 (Day 5)'''
<HR>
''Ancient DNA and 3D genomics''

<DT>9:00am-10:00pm</DT>
<DD>Ancient DNA ([https://teaching.healthtech.dtu.dk/material/22126/2026/dtu_adna_2026_red.pdf Lecture slides])</DD>
<DD>Martin Sikora</DD>

<DT>10:00pm-10:15pm</DT>
<DD>''Break''</DD>

<DT>10:15pm-12:00pm</DT>
<DD>Exercise: Ancient DNA ([[Ancient DNA exercise]]) ([[Ancient_DNA_exercise_answers]])</DD>
<DD>Martin Sikora, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DL>
<DT>1:00pm-2:00pm </DT>
<DD>Lecture: 3D Genomics with Hi-C ([https://teaching.healthtech.dtu.dk/material/22126/2026/3D_Genomics_Workshop.pdf Lecture slides])</DD>
<DD>Juan Rodríguez</DD>

<DT>2:00am-2:15pm</DT>
<DD>''Break''</DD>

<DT>2:15pm-4:00pm</DT>
<DD>Exercise: 3D Genomics with Hi-C ([[Exercise and answers]])</DD>
<DD> Juan Rodríguez, Amanda Gammelby Qvesel, Mads Hartmann</DD>
<BR>

<BR>

<HR>
'''Monday, January 12 (Day 6)'''
<HR>

''Microbial genomics''
<DL>
<DT>9:00am-9:45am </DT>
<DD>TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>9:45am-10:00am</DT>
<DD>''Break''</DD>

<DT>10:00am-12:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([ Lecture slides])</DD>
<DD>Rasmus Lykke Marvig</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: ([[ Microbial_genomics_exercise ]]) </DD>
<DD>Rasmus Lykke Marvig, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 13 (Day 7)'''
<HR>

''Phylogenomics''

<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([])</DD>
<DD>David Duchene</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Exercise]]) ([[Solution]])</DD>
<DD> David Duchene</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:45pm</DT>
<DD>Lecture: TBA ([Lecture slides])</DD>
<DD>David Duchene</DD>

<DT>1:45pm-2:00pm</DT>
<DD>''Break''</DD>

<DT>2:00pm-4:00pm</DT>
<DD>Exercise: TBA ([[Exercises]]) ([[Solution]]) </DD>
<DD>David Duchene, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 14 (Day 8)'''
<HR>
''Metabarcoding and group project''
<DL>
<DT>9:00am-9:55am</DT>
<DD>TBA ([ Lecture])</DD>
<DD>Luke Holman</DD>

<DT>9:55am-10:10am</DT>
<DD>''Break''</DD>

<DT>10:10am-12:00pm</DT>
<DD>Exercise: TBA ([[Metabarcoding Exercise]]) ([[Metabarcoding Solution]])</DD>
<DD> Luke Holman</DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break'' </DD>

<DT>1:00pm-1:30pm </DT>
<DD> Recap Test ([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test.pdf Test])([https://teaching.healthtech.dtu.dk/material/22126/2026/Recap_test_Answers.pdf Answers])</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann </DD>

<DT>1:30pm-1:45pm</DT>
<DD>''Break''</DD>

<DT>1:45pm-2:30pm </DT>
<DD>Projects & Group formation ([https://teaching.healthtech.dtu.dk/material/22126/2026/Poster.pdf Lecture slides] [http://teaching.healthtech.dtu.dk/material/22126/2023/posters.tar.gz Examples from previous courses]) </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>

<DT>2:30pm-4:00pm </DT>
<DD>Projects & Group formation, prepare an outline for tomorrow. please write group names in the [https://docs.google.com/document/d/1W5HzThk4zSi2xAE4dwmtgw35JtyNbwhuizseiLrxLr0/edit?usp=sharing document for 2026]</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

</DL>
<BR>

<HR>
'''Thursday, January 15 (Day 9)'''
<HR>
''Project work''
<DL>
<DT>10:00am-12:00pm</DT>
<DD>Project consolation, check when your 3 minutes are [https://docs.google.com/spreadsheets/d/1eZeAo0jtpUcJpd7ti8h2ofjVJD8wYOUws9QMZwp0fQ8/edit?usp=sharing Timesheet]</DD>

<DD></DD>

<DT>12:00pm-1:00pm </DT>
<DD>''Lunch Break''</DD>

<DT>1:00pm-4:00pm </DT>
<DD>Project </DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 16 (Day 10)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Monday, January 19 (Day 11)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Tuesday, January 20 (Day 12)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Wednesday, January 21 (Day 13)'''
<HR>
''Project work''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Project work</DD>

<DT>1:00pm-3:00pm</DT>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Thursday, January 22 (Day 14)'''
<HR>
''Project Work & Submit poster''
<DL>

<DT>10:00am-12:00pm</DT>
<DD>Q&A: Practical information about the exam</DD>
<DD>Project work/Office hours</DD>
<DD> Mick Westbury, Amanda Gammelby Qvesel, Mads Hartmann</DD>
</DL>

<BR>

<HR>
'''Friday, January 23 (Day 15)'''
<HR>
''Exam''
<DL>
<DT>9:00am-4:00pm</DT>
<DD>Written Exam</DD>
</DL>

Microbial genomics exercise

2026-01-11T21:43:33Z

Rasmus: /* Task05: Detection of antimicrobial resistance genes */

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results</code>:
<pre>
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb
mv ~/microbial_exercise/output_gtdb /home/projects/microbial_genomics/premade_task_results/.
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the five ST2326 genome assemblies using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna > ~/microbial_exercise/output_abricate.txt
</pre>

==== Question 1 ====

Are the same AMR genes found in the five genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No

</div>

==== Question 2 ====

Do all found AMR genes have a good match to a database reference gene ? (sequence coverage of > 90% and sequence identity > 95%)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No, three gene calls fails the criteria for a good match. Following command can be used the find the three gene calls:
<pre>
cat ~/microbial_exercise/output_abricate.txt | cut -f1,6,10,11 | awk '$3<=90 || $4<=95'
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Microbial genomics exercise

2026-01-11T21:42:52Z

Rasmus:

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results</code>:
<pre>
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb
mv ~/microbial_exercise/output_gtdb /home/projects/microbial_genomics/premade_task_results/.
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the five ST2326 genome assemblies using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna > ~/microbial_exercise/output_abricate.txt
</pre>

==== Question 1 ====

Are the same AMR genes found in the five genomes ?

==== Solution ====
No

</div>

==== Question 2 ====

Do all found AMR genes have a good match to a database reference gene ? (sequence coverage of > 90% and sequence identity > 95%)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
No, three gene calls fails the criteria for a good match. Following command can be used the find the three gene calls:
<pre>
cat ~/microbial_exercise/output_abricate.txt | cut -f1,6,10,11 | awk '$3<=90 || $4<=95'
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Microbial genomics exercise

2026-01-11T21:39:17Z

Rasmus: /* Task05: Detection of antimicrobial resistance genes */

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results</code>:
<pre>
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb
mv ~/microbial_exercise/output_gtdb /home/projects/microbial_genomics/premade_task_results/.
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the five ST2326 genome assemblies using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna > ~/microbial_exercise/output_abricate.txt
</pre>

==== Question 1 ====

Are the AMR genes found in the five genomes ?

==== Question 2 ====

Do all found AMR genes have a good match to a database reference gene ? (sequence coverage of > 90% and sequence identity > 95%)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
fef
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Microbial genomics exercise

2026-01-11T21:32:49Z

Rasmus: /* Task05: Detection of antimicrobial resistance genes */

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results</code>:
<pre>
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb
mv ~/microbial_exercise/output_gtdb /home/projects/microbial_genomics/premade_task_results/.
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the assembled genomes using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna > ~/microbial_exercise/output_abricate.txt
</pre>

==== Question 1 ====

==== Question 2 ====

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
fef
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Microbial genomics exercise

2026-01-11T21:31:27Z

Rasmus: /* Task05: Detection of antimicrobial resistance genes */

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results</code>:
<pre>
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb
mv ~/microbial_exercise/output_gtdb /home/projects/microbial_genomics/premade_task_results/.
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the assembled genomes using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna
</pre>

==== Question 1 ====

==== Question 2 ====

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
fef
</pre>

</div>

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Microbial genomics exercise

2026-01-11T21:30:22Z

Rasmus:

== Introduction and getting ready ==

Dear course participants,

In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.

You can run the tools using the provided BASH executables in <code>/home/ctools/bin</code>.
The paths to the BASH executables are:

<pre>
/home/ctools/bin/kraken.sh
/home/ctools/bin/kraken_report.sh
/home/ctools/bin/gtdbtk.sh
/home/ctools/bin/mlst.sh
/home/ctools/bin/parsnp.sh
/home/ctools/bin/abricate.sh
</pre>

<code>/home/ctools/bin</code> should already be part of your standard configured $PATH, but, if not, then you can add it by the following command and you should be able to execute the tools with writing the full path:
<pre>
export PATH="/home/ctools/bin:$PATH"
</pre>

It is preferred that you should run the commands from your $HOME directory. So start by these commands that prints the path to your home directory and takes you there:
<pre>
echo $HOME
cd
</pre>

Now, let us image that we are employed at a hospital to provide diagnostics for patient care...

== Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens ==

You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".

[[Media:Nielsen et al 2025 Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections.pdf|Click here to see research article by Nielsen et al.]]

Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1108520/ BioProject PRJNA1108520].

You can use the prefetch and fasterq-dump tools from the [https://github.com/ncbi/sra-tools SRA Toolkit] to download the data. This has already been done and the data is in <code>/home/projects/microbial_genomics/task01_sequence_reads</code>.

Here are the commands that can be used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
<pre>
cd $HOME
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX
</pre>

The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information [https://teaching.healthtech.dtu.dk/22126/images/6/64/Spectrum.03295-24-s0001.docx available here].

Table that translates run accession numbers to patient identifiers:

{| class="wikitable"
! sra_run_accession !! sample_id !! patient_id
|-
| SRR28959350 || s001 || p001
|-
| SRR28959349 || s002 || p002
|-
| SRR28959338 || s004 || p020
|-
| SRR28959334 || s006 || p028
|-
| SRR28959333 || s008 || p049
|-
| SRR28959332 || s009 || p072
|-
| SRR28959331 || s013 || p091
|-
| SRR28959330 || s014 || p098
|-
| SRR28959329 || s015 || p104
|-
| SRR28959328 || s016 || p105
|-
| SRR28959348 || s029 || p019
|-
| SRR28959347 || s031 || p092
|-
| SRR28959346 || s033 || p127
|-
| SRR28959345 || s034 || p128
|-
| SRR28959344 || s041 || p139
|-
| SRR28959343 || s042 || p140
|-
| SRR28959342 || s043 || p141
|-
| SRR28959341 || s044 || p143
|-
| SRR28959335 || s057 || p164
|-
| SRR28959340 || s058 || p172
|-
| SRR28959339 || s059 || p173
|-
| SRR28959337 || s060 || p175
|-
| SRR28959336 || s061 || p183
|}

The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.

==== Question 1 ====
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?

<pre>
wc -l /home/projects/microbial_genomics/task01_sequence_reads/*.fastq | awk 'NF==2 {print $2 ": " $1/4}'
</pre>

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959328.fastq: 174
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq: 6489
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959330.fastq: 43
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959331.fastq: 56
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959332.fastq: 188
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959333.fastq: 9
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959334.fastq: 1426
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959335.fastq: 288
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959336.fastq: 839
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959337.fastq: 899
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959338.fastq: 6
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq: 217
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959340.fastq: 1879
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959341.fastq: 12
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959342.fastq: 45
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959343.fastq: 251
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959344.fastq: 2572
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959345.fastq: 69
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959346.fastq: 180
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959347.fastq: 37
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959348.fastq: 224
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959349.fastq: 303
/home/projects/microbial_genomics/task01_sequence_reads/SRR28959350.fastq: 16
</pre>

</div>

Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.

Run Kraken for two of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/microbial_exercise/SRR28959339.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/SRR28959339.kraken.out.txt > ~/microbial_exercise/SRR28959339.kraken.report.txt
</pre>

The output files contain:
* .kraken.out: A classification result for each read
* .kraken.report: A summary report of taxonomic assignments

==== Question 2 ====
How does your result align with the mNGS result in Table S3 from Nielsen et al.?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

You can use the following command to see if a considerable percentage (1st column) of reads classified to the species-level (6th column) corresponds to the species names presented in Table S3 from Nielsen et al.
<pre>
cat ~/microbial_exercise/SRR28959339.kraken.report.txt | grep -w S
</pre>

</div>

The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.

==== Question 3 ====
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

500,9 gigabytes

</div>

Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a ''Stapholycoccus aureus'' isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>.

The hospital laboratory asks you to run Kraken on the 100,000 reads with these commands:

<pre>
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.out.txt > ~/microbial_exercise/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1.kraken.report.txt
</pre>

The vast majority of reads that could be classified to species-level are assigned to ''Stapholycoccus aureus'', but also a minority of reads are assigned to other species.

==== Question 4 ====
Do you think that these reads represent contamination of DNA from other species ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====

The reads could represent contamination contamination, but they could also represent misclassifications by Kraken.
</div>

== Task02: Species-level taxonomic classification of bacterial genomes ==

The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from clinical specimens. The sequence reads have been ''de novo'' assembled with [https://github.com/tseemann/shovill Shovill}], and the genome assemblies are stored in FASTA-formatted files available in <code>/home/projects/microbial_genomics/genome_assemblies</code>.

You want to determine the bacterial species of the assembled genomes.

We can use the [https://github.com/Ecogenomics/GTDBTk GTDB-Tk tool] to assign taxonomic classifications to bacterial genomes based on the [https://gtdb.ecogenomic.org Genome Taxonomy Database (GTDB)].

Run <code>gtdbtk.sh -h</code> to get help information on how to use GTDB-Tk.

Unfortunately, GTDK-Tk uses a lot of memory, so it is not feasible to run the tool at the course servers. Instead, the following command has already been to determine the species of the genomes and the results have been moved to <code>/home/projects/microbial_genomics/premade_task_results</code>:
<pre>
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/task02_assemblies --out_dir ~/microbial_exercise/output_gtdb
mv ~/microbial_exercise/output_gtdb /home/projects/microbial_genomics/premade_task_results/.
</pre>

==== Question 1 ====
What species are the genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH031220x1x74_210120_A00559_AHYKKMDRXX Escherichia ruysiae
SH090321x2x74_210615_A01176_AHFKVKDRXY Escherichia marmotae
SH178x24_220708_NS500314_AHTFHHAFX3 Klebsiella michiganensis
SH179x37_220805_A01176_AHG77VDRX2 Citrobacter braakii
SH182x46_220812_A00559_BHG7JLDRX2 Klebsiella grimontii
SH185x39_220812_A00559_BHG7JLDRX2 Acinetobacter geminorum
SH189x45_220812_A00559_BHG7JLDRX2 Mycobacterium intracellulare
SH204x81_220930_NS500314_AHW7FKAFX3 No species called
SH204x82_220930_NS500314_AHW7FKAFX3 No species called
SH214x27_221104_A00559_AHKYCVDRX2 Citrobacter europaeus
SH214x54_221104_A00559_AHKYCVDRX2 Klebsiella huaxiensis
SH232x83_230120_A01411_AHKYCLDRX2 Enterococcus_B lactis
SH249x46_230306_A01411_AHMF5GDRX2 Klebsiella africana
SH257x03_230317_A00559_AHM5H3DRX2 Enterobacter mori
SH272x52_230512_A00559_AH5MNFDRX3 Stenotrophomonas maltophilia_AJ
SH274x36_230421_A00559_AHYGVWDRX2 Klebsiella michiganensis
SH275x46_230526_A01411_BHC33YDRX3 Klebsiella variicola
SH283x25_230721_A00559_BHGVVNDRX3 Citrobacter braakii
SH292x11_230714_A01411_AHGW7MDRX3 Stenotrophomonas maltophilia_G
SH315x12_230929_A01176_AHKHK3DRX3 No species called
SH333x35_231006_NB501792_AHMVFTAFX5 Enterobacter intestinihominis
SH338x33_231110_A00559_AHMTNJDRX3 Klebsiella quasipneumoniae
SH339x55_231020_NB501792_AHMTK7AFX5 Citrobacter_A sedlakii
SH344x10_231103_A00559_BHMN2KDRX3 Escherichia marmotae
SH344x63_231103_A00559_BHMN2KDRX3 Bacillus_A cereus
SH351x67_231211_A00559_AHMTCTDRX3 Acinetobacter pittii
SH372x20_240212_A01961_AHVFV3DRX3 Acinetobacter fasciculus
SH406x40_240426_NB501792_AH5LK3AFX7 Acinetobacter nosocomialis
SH420x32_240607_A01961_AH7HW2DRX5 Achromobacter ruhlandii
SH429x69_240628_A01961_AHC7NCDRX5 No species called
SH432x68_240705_A00559_AHFKFHDRX5 No species called
SH437x71_240906_A01411_AHGVFVDRX5 Klebsiella planticola
SH443x95_240809_A00559_AHFL5YDRX5 No species called
SH463x12_240927_NS500314_AHHJHMAFX7 Citrobacter portucalensis
SH485x65_241122_A01411_BHM2VFDRX5 No species called
SH490x72_241206_A01411_AHM2CHDRX5 No species called
SH512x30_250321_A01961_BHTMCKDRX5 Acinetobacter fasciculus
SH529x38_250613_LH00793_B22C72LLT1 Klebsiella grimontii
SH545x01_250516_A01411_AHYJHCDRX5 Serratia nevei
SH567x14_250926_LH00255_A22YTGHLT3 Klebsiella variicola
SH623x49_251128_LH00255_B22FYTTLT1 No species called
SH623x50_251128_LH00255_B22FYTTLT1 No species called
</pre>

</div>

GTDB-Tk is unable to call a species for 11 of the genomes. Two of these genomes are SH315x12_230929_A01176_AHKHK3DRX3 and SH490x72_241206_A01411_AHM2CHDRX5.

==== Question 2 ====
What is the average nucleotide identity for the best match for each of these genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
SH315x12_230929_A01176_AHKHK3DRX3 GCA_023446155.1, s__Flavobacterium sp023446155, 90.92%
SH490x72_241206_A01411_AHM2CHDRX5 GCA_021654635.1, s__Mycobacterium kiyosense, 82.85%
</pre>

</div>

== Task03: Subspecies-level genetic typing of bacterial genomes ==

After species identification, it is most often relevant to further genetically group isolates at a subspecies-level. Such groups may be referred to as genotypes, lineages, clone types, clonal complexes, strains, depending on the organism and method used, and methods may differ the approach and the genetic markers that they use for grouping.

Bacterial genomes are commonly grouped using multilocus sequence typing (MLST), which uses the sequences of typically seven gene loci present in all members of a species. The combination of allele numbers assigned to sequence variants at each locus defines the sequence type (ST).

The laboratory has sequenced and de novo assembled the genomes of 36 bacterial isolates of species ''Pseudomonas aeruginosa''. The assembled genomes are available in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

Use tool [https://github.com/tseemann/mlst mlst] by Torsten Seemann to determine the sequence types of the 36 genomes.

Start by inspecting the available options and then run the tool on the genomes:

<pre>
mlst.sh -h
mlst.sh --scheme paeruginosa /home/projects/microbial_genomics/task03_assemblies/*fna > ~/microbial_exercise/output_mlst.txt
</pre>

You will find 3 different allele combinations among the 36 genomes. E.g., use this command to see the allele combinations:
<pre>
cat output_mlst.txt | cut -f4-10 |sort | uniq -c
</pre>

==== Question 1 ====
What allele combinations did you find and what are the correspoonding sequence types ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
Unknown sequence type 6 that has allele combination acsA(38) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
Sequence type 2326 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(202)
Sequence type 207 that has allele combination acsA(47) aroE(4) guaA(5) mutL(33) nuoD(1) ppsA(6) trpE(40)
</pre>

</div>

==== Question 2 ====
How many alleles differ between the sequence types ? (ST207 versus ST2326, ST207 versus the unknown ST, ST2326 versus the unknown ST)

<div class="mw-collapsible mw-collapsed">

==== Solution ====
<pre>
ST207 versus ST2326: 1 allele differs
ST207 versus the unknown ST: 1 allele differs
ST2326 versus the unknown ST: 2 allele differs
</pre>

</div>

== Task04: Whole-genome based determination of genetic relatedness ==

You used sequence information from seven genes in the previous task; nonetheless, we can improve the resolution of our genomic analysis if we make use of the whole genome sequence. Tool [https://github.com/marbl/parsnp Parsnp] can be used align whole genome sequences and to detect single nucleotide substitution variants in the sequence alignments. In the scope of this task, we refer to such substitution variants as single nucleotide polymorphisms (SNPs).

First, view the Parsnp help information:
<pre>
parsnp.sh -h
</pre>

You now want to use Parsnp to compare the 36 ''Pseudomonas aeruginosa'' genomes in <code>/home/projects/microbial_genomics/task03_assemblies</code>.

You can use the following command to compare the genomes with Parsnp:
<pre>
parsnp.sh -r ! -d /home/projects/microbial_genomics/task03_assemblies -o ~/microbial_exercise/output_parsnp -p 10 -c --use-fasttree
</pre>

Above command produces an output file ''parsnp.ggr'' whcih is a compressed representation of the alignment result.

You can use [https://github.com/marbl/harvest-tools Harvesttools] to convert alignment into an alignment represented in a multi-FASTA format that include only SNPs positions:
<pre>
harvesttools.sh -i ~/microbial_exercise/output_parsnp/parsnp.ggr -S ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta
</pre>

You can use [https://github.com/tseemann/snp-dists snp-dists] to convert the multi-FASTA alignment into SNP-distances, i.e. a measure of the genetic relationship (distance) between the genomes. snp-dists finds pairwise SNP-distances that can be presented in a square matrix (standard representation) or in a list (option ''-m''). You can test both with the following commands:
<pre>
snp-dists.sh ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | less -S
</pre>

Note, all pairwise SNP-distances are outputted twice in both the matrix and the list (A versus B and B versus A, respectively).

Parsnp also uses the aligment to make a phylogenetic tree (file ''parsnp.tree'' in output). The tree is in Newick format, and you can, e.g., use [https://itol.embl.de/ iTol] to visualize the tree. Below, I have inserted a visualization of the tree:

[[File:Full_tree.png|1000px]]

The large genetic distances genomes of the unknown ST to both ST207 and ST2326 makes difficult to see the structure of the tree for genomes of the same ST. Below, I have inserted the same tree but without showing genomes of the unknown ST:

[[File:STunknown_removed_tree.png|1200px]]

The genomes are named according their ST and the origin of the bacterial isolate. The origin can either be a person (person 1 to 19) or the hospital environment. You may have noticed that the structure of the phylogenetic tree corresponds well to the ST and origin of genomes.

Now, you want to parse the listed pairwise genetic distances to find out how they are distributed dependent on whether the genomes are from the same ST and/or origin.

We already saw from the phylogenetic tree that genomes of the unknown ST are distant from genomes of ST207 and ST2326. You can use the following command to find the minimum and maximum distances between genomes of the unknown ST and genomes of either ST207 or ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

You can use the following command to find the minimum and maximum distances between genomes of ST207 and ST2326:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v STunknown | sed 's/_/\t/1; s/_/\t/5' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | sed -n '1p;$p'
</pre>

==== Question 1 ====
Do you think that the grouping by MLST reflects well the genetic distances between genomes ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Genomes of the unknown ST have minimum 16,339 SNPs to genomes of ST207 or ST2326, whereas genomes og ST207 are maximum different by 897 SNPs.

Accordingly, the unknown ST is genetically more distant from ST207 and ST2326 than it is apparent from the MLST.

</div>

Below is a slide from today's lecture that shows the distribution of genetic distances (pairwise SNP distances) in a collection of 446 ''Pseudomonas aeruginosa'' genomes.

[[File:Slide_pa_distribution.png|1000px]]

==== Question 2 ====

Based on the shown distribution, do you think that the following statements are true or false:
* A: ST207 and ST2326 are the same strain ?
* B: ST207 and the unknown ST are the same phylogroup ?
* C: ST2326 and the unknown ST are different strains ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
* A: True
* B: True
* C: True

</div>

You can use the following command to find the distances between genomes from the same person:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | grep -v environment | sed 's/_/\t/2; s/_/\t/6' | awk '$1==$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5
</pre>

==== Question 3 ====
If we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2, what is the typical SNP-distance between genomes from bacteria sampled from the same person ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
There are 7 to 122 SNPs between genomes from the same person (if we disregard genome 101174x20C_201204_NB501792_AHJF5LAFX2).

</div>

If two genomes differ by only a small number of SNPs, this suggests they diverged recently from a common ancestor. When such closely related genomes are identified in samples from different persons (patients) within a hospital, it raises the possibility that transmission occurred within the hospital.

You can find the minimum SNP-distance between genomes from different persons with this command:
<pre>
snp-dists.sh -m ~/microbial_exercise/output_parsnp/multi_fasta_alignment_with_SNPs.fasta | sed 's/_/\t/2; s/_/\t/6' | awk '$1!=$3' | awk '$2!=$4' | grep -v fna.ref | sort -nrk5 | grep -v environment | tail -n1
</pre>

==== Question 4 ====
Do some of the observed SNP-distances indicate that there has been transmission of ''Pseudomonas aeruginosa'' at the hospital ?

<div class="mw-collapsible mw-collapsed">

==== Solution ====
Yes, some genomes from different persons have pairwise SNP-distances that are low relative the the distances between genomes from the same person. A quick search for information on the world wide web may suggest that for ''Pseudomonas aeruginosa'', SNP-distances less than around 30 SNPs indicates transmission at the hospital.

</div>

== Task05: Detection of antimicrobial resistance genes ==

Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.

The tool [https://github.com/tseemann/abricate Abricate] can be used to screen genome assemblies against curated resistance gene databases.

Inspect the available options and databases with the following commands:
<pre>
abricate.sh -h
abricate.sh --list
</pre>

Run Abricate on the assembled genomes using the Resfinder database:

<pre>
abricate.sh --db resfinder /home/projects/microbial_genomics/task03_assemblies/ST2326*.fna
</pre>

==== Question 1 ====

==== Question 2 ====

== Supplementary information and files ==

[https://teaching.healthtech.dtu.dk/22126/images/c/cd/Gtdbtk.pdf Article describing GTDB-Tk can be found here]

The BASH executables in <code>/home/ctools/bin</code> uses Apptainer container images (.sif file format) that are located <code>/home/projects/microbial_genomics/singularity_image_files</code>.

Microbial genomics exercise

2026-01-11T21:29:56Z