Microbial genomics exercise
Introduction
Dear course participants,
In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.
The tools are available at the server as Apptainer container images.
The image files (.sif file format) are located in /home/projects/microbial_genomics/singularity_image_files, but you do not need to call them directly. Instead, you can run the tools using the provided BASH executables in /home/ctools/bin, which is already available via your standard configured $PATH. The names of the BASH executables are:
kraken.sh kraken_report.sh gtdbtk.sh mlst.sh parsnp.sh abricate.sh
Now, let us image that we are employed at a hospital to provide diagnostics for patient care...
Task01: XXX
Your hospital directorship has found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University Hospital and Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".
Link to research article by Nielsen et al.
Nielsen et al. have deposited microbial DNA sequencing data from positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in BioProject PRJNA1108520.
The hospital directorship have used the SRA Toolkit to download the data to /home/projects/microbial_genomics/task01_sequence_reads<\code>.
The hospital laboratory has sequenced genomic DNA from a clinical specimen. The paired-end sequencing reads are stored in FASTQ files that are compressed with gzip. The files can be found in /home/projects/microbial_genomics/sequence_reads.
sra_run_accession
sample_id
patient_id
SRR28959350
s001
p001
SRR28959349
s002
p002
SRR28959338
s004
p020
SRR28959334
s006
p028
SRR28959333
s008
p049
SRR28959332
s009
p072
SRR28959331
s013
p091
SRR28959330
s014
p098
SRR28959329
s015
p104
SRR28959328
s016
p105
SRR28959348
s029
p019
SRR28959347
s031
p092
SRR28959346
s033
p127
SRR28959345
s034
p128
SRR28959344
s041
p139
SRR28959343
s042
p140
SRR28959342
s043
p141
SRR28959341
s044
p143
SRR28959335
s057
p164
SRR28959340
s058
p172
SRR28959339
s059
p173
SRR28959337
s060
p175
SRR28959336
s061
p183
``
Read the first lines of one of the files to inspect the content of the file, e.g., use this command:
zcat /home/projects/microbial_genomics/sequence_reads/SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz | head
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959329.fastq > ~/SRR28959329.kraken.out.txt
kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/SRR28959329.kraken.out.txt > ~/SRR28959329.kraken.report.txt
Task02: What species is the genome ?
The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from a clinical specimen. The sequence reads have been de novo assembled, and the genome assemblies are stored in FASTA-formatted files available in /home/projects/microbial_genomics/genome_assemblies.
Your want to determine the bacterial species of the assembled genomes.
We can use the GTDB-Tk tool to assign taxonomic classifications to bacterial genomes based on the Genome Database Taxonomy (GTDB).
Run gtdbtk.sh -h to get help information on how to use GTDB-Tk.
Use GTDB-Tk to determine the species of the genomes:
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/ex02_assemblies --out_dir $HOME/output
Question: What species are the genomes ?
Task03: What sequence type is the genome?
In addition to species identification, sequence typing is commonly used in clinical microbiology to compare isolates and support outbreak investigations.
Multilocus Sequence Typing (MLST) assigns isolates to a sequence type (ST) based on the allelic profiles of a defined set of housekeeping genes.
The assembled genomes are available in:
/home/projects/microbial_genomics/ex02_assemblies
Run the MLST tool to determine the sequence type of each genome.
Start by inspecting the available options:
mlst.sh -h
Then run MLST on the genome assemblies:
mlst.sh /home/projects/microbial_genomics/ex02_assemblies/*.fna
Questions:
- What MLST scheme is used for each genome?
- What sequence type (ST) is assigned to each isolate?
- Are all genomes assigned to the same ST?
---
Task04: Which antimicrobial resistance genes are present?
Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.
The tool ABRicate can be used to screen genome assemblies against curated resistance gene databases.
Run ABRicate on the assembled genomes using a resistance gene database.
First, inspect the available options and databases:
abricate.sh -h
abricate.sh --list
Then screen the genomes using the ResFinder database:
abricate.sh --db resfinder /home/projects/microbial_genomics/ex02_assemblies/*.fna
Questions:
- Which antimicrobial resistance genes are detected in each genome?
- Are the resistance profiles identical across the isolates?
- Based on the detected genes, which antibiotic classes might be ineffective?
---
Whole-genome comparisons are frequently used to assess the relatedness of bacterial isolates, for example during suspected outbreaks.
The tool Parsnp performs core-genome alignment and identifies single nucleotide polymorphisms (SNPs) between closely related genomes.
Use Parsnp to compare the assembled genomes.
First, view the help information:
parsnp.sh -h
Then run Parsnp using one genome as the reference:
parsnp.sh -r /home/projects/microbial_genomics/ex02_assemblies/genome1.fna \
-d /home/projects/microbial_genomics/ex02_assemblies \
-o $HOME/parsnp_out
Parsnp produces a core-genome alignment and a phylogenetic tree.
Questions:
- How many SNPs separate the isolates?
- Do the genomes cluster closely together?
- Based on the results, do the isolates appear to be clonally related?
---
Summary
In this exercise you have:
- Identified the species of bacterial genomes using GTDB-Tk
- Determined sequence types using MLST
- Screened for antimicrobial resistance genes using ABRicate
- Assessed genomic relatedness using Parsnp
Together, these analyses reflect a typical bioinformatics workflow used in microbial diagnostics and epidemiological investigations.
EX06: What organisms are present in the sequencing reads?
In some diagnostic scenarios, genome assembly is not immediately available or desirable. Instead, raw sequencing reads can be classified directly to identify the organisms present in a clinical sample.
In this exercise you will use Kraken to perform taxonomic classification of sequencing reads from a publicly available dataset.
Data source
The sequencing data originate from the NCBI BioProject:
https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA1108520
The dataset contains paired-end Illumina sequencing reads.
Downloading the sequencing data
We will use the NCBI SRA Toolkit to download the data.
First, create a working directory:
mkdir -p $HOME/ex06_kraken
cd $HOME/ex06_kraken
Download one of the sequencing runs associated with the BioProject (example run accession shown below):
prefetch SRRXXXXXXX
fasterq-dump SRRXXXXXXX --split-files --gzip
This will produce two compressed FASTQ files:
SRRXXXXXXX_1.fastq.gz
SRRXXXXXXX_2.fastq.gz
Inspecting the reads
Inspect the first few lines of one of the FASTQ files:
zcat SRRXXXXXXX_1.fastq.gz | head
Question:
- What information is stored in each FASTQ record?
---
Taxonomic classification using Kraken
Kraken performs taxonomic classification by comparing k-mers from sequencing reads to a reference database.
A pre-built Kraken database is available on the system.
Start by viewing the help information:
kraken2.sh -h
Run Kraken on the paired-end reads:
kraken2.sh \
--db /home/projects/microbial_genomics/kraken_db \
--paired SRRXXXXXXX_1.fastq.gz SRRXXXXXXX_2.fastq.gz \
--report kraken_report.txt \
--output kraken_output.txt
The output consists of:
- A classification result for each read
- A summary report of taxonomic assignments
---
Interpreting the Kraken report
Examine the Kraken report:
less kraken_report.txt
Questions:
- Which species are most abundant in the sample?
- Are multiple organisms detected?
- Do the results suggest contamination or a mixed infection?
---
Optional: Estimating classification confidence
Kraken classifications can be refined using confidence thresholds.
Re-run Kraken with a confidence threshold:
kraken2.sh \
--db /home/projects/microbial_genomics/kraken_db \
--paired SRRXXXXXXX_1.fastq.gz SRRXXXXXXX_2.fastq.gz \
--confidence 0.1 \
--report kraken_report_confidence.txt \
--output kraken_output_confidence.txt
Questions:
- How does applying a confidence threshold affect the results?
- Which classifications are most sensitive to filtering?
---
Summary
In this exercise you have:
- Downloaded sequencing reads from NCBI SRA
- Inspected raw FASTQ data
- Classified sequencing reads using Kraken
- Interpreted taxonomic profiles from a clinical sample
This workflow reflects a common first step in metagenomic and diagnostic sequencing analyses.
Supplementary files