Microbial genomics exercise: Difference between revisions
No edit summary |
No edit summary |
||
| Line 123: | Line 123: | ||
Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ? | Check [https://benlangmead.github.io/aws-indexes/k2 this webpage] to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ? | ||
Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a Stapholycoccus aureus isolate and have made 100,000 of the sequence reads available for you in <code>/home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz</code>. | |||
| Line 132: | Line 132: | ||
</pre> | </pre> | ||
Revision as of 14:40, 8 January 2026
Introduction
Dear course participants,
In this exercise you will analyse microbial genome sequences using bioinformatics tools that are commonly used for microbial diagnostics and research.
The tools are available at the server as Apptainer container images.
The image files (.sif file format) are located in /home/projects/microbial_genomics/singularity_image_files, but you do not need to call them directly. Instead, you can run the tools using the provided BASH executables in /home/ctools/bin, which is already available via your standard configured $PATH. The names of the BASH executables are:
kraken.sh kraken_report.sh gtdbtk.sh mlst.sh parsnp.sh abricate.sh
Now, let us image that we are employed at a hospital to provide diagnostics for patient care...
Task01: Taxonomic assignment of sequencing reads for detection of microbial pathogens
You have found a great prospective observational research study from May 2025 by Nielsen et al. from Aalborg University on the "Application of rapid Nanopore metagenomic cell-free DNA sequencing to diagnose bloodstream infections".
Link to research article by Nielsen et al.
Nielsen et al. have deposited microbial DNA sequencing data from 23 positive metagenomic next-generation sequencing tests in National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) in BioProject PRJNA1108520.
You have used the prefetch and fasterq-dump tools from the SRA Toolkit to download the data to /home/projects/microbial_genomics/task01_sequence_reads.
Here are the commands that you used to acquire the data (this is already done, but you are welcome to try yourselves in your home directory:
cd $HOME prefetch SRRXXXXXXX fasterq-dump SRRXXXXXXX
The files are named according to the SRA run accessions numbers (SRRXXXXXXXX). After email communication with Nielsen et al., you are able to translate run accession numbers to patient identifiers used in Table S3 (Detailed metadata on the assessment of relevance of mNGS findings) in the article supplementary information available here.
Table that translate run accession numbers to patient identifiers:
| sra_run_accession | sample_id | patient_id |
|---|---|---|
| SRR28959350 | s001 | p001 |
| SRR28959349 | s002 | p002 |
| SRR28959338 | s004 | p020 |
| SRR28959334 | s006 | p028 |
| SRR28959333 | s008 | p049 |
| SRR28959332 | s009 | p072 |
| SRR28959331 | s013 | p091 |
| SRR28959330 | s014 | p098 |
| SRR28959329 | s015 | p104 |
| SRR28959328 | s016 | p105 |
| SRR28959348 | s029 | p019 |
| SRR28959347 | s031 | p092 |
| SRR28959346 | s033 | p127 |
| SRR28959345 | s034 | p128 |
| SRR28959344 | s041 | p139 |
| SRR28959343 | s042 | p140 |
| SRR28959342 | s043 | p141 |
| SRR28959341 | s044 | p143 |
| SRR28959335 | s057 | p164 |
| SRR28959340 | s058 | p172 |
| SRR28959339 | s059 | p173 |
| SRR28959337 | s060 | p175 |
| SRR28959336 | s061 | p183 |
The article does not mention how many microbial DNA sequencing reads that there are in each of the 23 positive metagenomic next-generation sequencing tests.
Question 1
Check how many reads are in the FASTQ file for each of the run accessions, .e.g., by using the following command ?
grep -c ">" /home/projects/microbial_genomics/task01_sequence_reads/*fastq
Now, you want to try to determine the species the DNA that has been sequenced. For this, you can use Kraken that is a tool for taxonomic classification of DNA sequnece reads. Kraken does this by splitting the sequence read into k-mers that are then matched to a reference database with information about the lowest common ancestor of all organisms whose genomes contain that k-mer. After the matching, Kraken assign the read to the taxon that receives the strongest cumulative k-mer support. Accordingly, the Kraken performance relies on the sequence and taxonomic classification of the genomes that has been used to build the database. In this exercise, we use Kraken 1 (Kraken also comes in a Kraken 2 version) with the relative small MiniKraken DB_8GB database.
Run Kraken for 3 of the sample files from Nielsen et al., e.g., run Kraken on SRR28959339:
kraken.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB /home/projects/microbial_genomics/task01_sequence_reads/SRR28959339.fastq > ~/SRR28959339.kraken.out.txt kraken-report.sh -db /home/projects/microbial_genomics/minikraken_20171019_8GB ~/SRR28959339.kraken.out.txt > ~/SRR28959339.kraken.report.txt
The output files contain:
- .kraken.out: A classification result for each read
- .kraken.report: A summary report of taxonomic assignments
Question 2
How does your result align with the mNGS result in Table S3 from Nielsen et al.?
The size of the MiniKraken DB_8GB database used in this exercise, is around 8 gigabytes. Nielsen et al. uses another approach to taxonomically classify reads, and their approach uses a database consisting of Genome Taxonomy Database (GTDB) representative bacterial and archaeal genomes from release 207 and virus (complete genomes) and fungi (all genomes) from NCBI RefSeq release 215.
Question 3
Check this webpage to see what is the size of a database for Kraken 2 that is based on GTDB relases 226 ?
Your hospital have not yet implemented mNGS, but the hospital already uses NGS to sequence the genomes of bacterial isolates cultivated from clinical specimens. Accordingly, the hospital laboratory has sequenced the genome of a Stapholycoccus aureus isolate and have made 100,000 of the sequence reads available for you in /home/projects/microbial_genomics/task01_sequence_reads/100K_cap_SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz.
Read the first lines of one of the files to inspect the content of the file, e.g., use this command:
zcat /home/projects/microbial_genomics/sequence_reads/SH631x88_251212_LH00793_A22G7YYLT1_R1.fastq.gz | head
Task02: What species is the genome ?
The laboratory has sequenced genomic DNA from single-colony isolates of bacteria cultivated from a clinical specimen. The sequence reads have been de novo assembled, and the genome assemblies are stored in FASTA-formatted files available in /home/projects/microbial_genomics/genome_assemblies.
Your want to determine the bacterial species of the assembled genomes.
We can use the GTDB-Tk tool to assign taxonomic classifications to bacterial genomes based on the Genome Database Taxonomy (GTDB).
Run gtdbtk.sh -h to get help information on how to use GTDB-Tk.
Use GTDB-Tk to determine the species of the genomes:
gtdbtk.sh classify_wf --extension .fna --cpus 10 --genome_dir /home/projects/microbial_genomics/ex02_assemblies --out_dir $HOME/output
Question: What species are the genomes ?
Task03: What sequence type is the genome?
In addition to species identification, sequence typing is commonly used in clinical microbiology to compare isolates and support outbreak investigations.
Multilocus Sequence Typing (MLST) assigns isolates to a sequence type (ST) based on the allelic profiles of a defined set of housekeeping genes.
The assembled genomes are available in:
/home/projects/microbial_genomics/ex02_assemblies
Run the MLST tool to determine the sequence type of each genome.
Start by inspecting the available options:
mlst.sh -h
Then run MLST on the genome assemblies:
mlst.sh /home/projects/microbial_genomics/ex02_assemblies/*.fna
Questions:
- What MLST scheme is used for each genome?
- What sequence type (ST) is assigned to each isolate?
- Are all genomes assigned to the same ST?
---
Task04: Which antimicrobial resistance genes are present?
Detection of antimicrobial resistance (AMR) genes is an important part of microbial diagnostics.
The tool ABRicate can be used to screen genome assemblies against curated resistance gene databases.
Run ABRicate on the assembled genomes using a resistance gene database.
First, inspect the available options and databases:
abricate.sh -h abricate.sh --list
Then screen the genomes using the ResFinder database:
abricate.sh --db resfinder /home/projects/microbial_genomics/ex02_assemblies/*.fna
Questions:
- Which antimicrobial resistance genes are detected in each genome?
- Are the resistance profiles identical across the isolates?
- Based on the detected genes, which antibiotic classes might be ineffective?
---
Whole-genome comparisons are frequently used to assess the relatedness of bacterial isolates, for example during suspected outbreaks.
The tool Parsnp performs core-genome alignment and identifies single nucleotide polymorphisms (SNPs) between closely related genomes.
Use Parsnp to compare the assembled genomes.
First, view the help information:
parsnp.sh -h
Then run Parsnp using one genome as the reference:
parsnp.sh -r /home/projects/microbial_genomics/ex02_assemblies/genome1.fna \
-d /home/projects/microbial_genomics/ex02_assemblies \
-o $HOME/parsnp_out
Parsnp produces a core-genome alignment and a phylogenetic tree.
Questions:
- How many SNPs separate the isolates?
- Do the genomes cluster closely together?
- Based on the results, do the isolates appear to be clonally related?
---
Summary
In this exercise you have:
- Identified the species of bacterial genomes using GTDB-Tk
- Determined sequence types using MLST
- Screened for antimicrobial resistance genes using ABRicate
- Assessed genomic relatedness using Parsnp
Together, these analyses reflect a typical bioinformatics workflow used in microbial diagnostics and epidemiological investigations.