Genomic epidemiology exercise

From 22126
Jump to navigation Jump to search

Overview

You are a bioinformatician working with epidemiology in Copenhagen. In the last couple of days more and more people have been admitted to the Hospitals in the region with severe bacteremia (bacteria in the bloodstream) and several patients, especially infants and young children, have had fatal outcomes. Luckily the probable causative infectious bacteria has been isolated and sequenced on a desktop sequencer and now it is your job to find out:

  1. What is it?
  2. Have we seen it before?
  3. How can we treat it (is it drug-resistant)?

Preparation

Make a copy of the data

mkdir genomic_epi
cd genomic_epi
cp /home/projects/22126_NGS/exercises/genomic_epi/ref* .
cp /home/projects/22126_NGS/exercises/genomic_epi/unk* .
cp /home/projects/22126_NGS/exercises/genomic_epi/data/reads* .

Preprocessing

Lets look at the quality of the number of reads that we have and the quality data. For the fastx_readlength.py command the output is "avg readlength, min length, max length, no. reads and no. bases".

Q1. How many reads and bases do we have in total? (Hint: Use tools from previous lectures to figure this out)

Task1: Run quality control using fastqc to check for quality of data.

Q2. Are there any aspects of the data you would want to correct? Would you trim adapters, for example?

Task2: Lets trim the data, the sequencing adapters and primers used are (even though there are no overrepresented seqs in R2):

Adaptor for R1: ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC
Adaptor for R2: GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC

Error correction

Count kmers and let's plot the kmer distribution - again we count 15-mers because we are looking at a bacterial genome.

Q3. Where is the peak at? Do we have enough coverage of the genome to correct the reads (eg. above 15X)? Hint: use jellyfish count and jellyfish histo to generate histogram of kmer depths

Let's correct the reads using Musket, here we use the kmer distribution as above to identify "true" and "error" kmers and pass through the reads correcting true reads. First, get the number of distinct k-mers in the dataset and then run musket. Replace "XYZ" in the musket command with the number of distinct k-mers before running it.

Task 3: Use error correction command from previous exercises to do the error correction. If musket takes too long or won't run, use pre-generated files.


What is it?


Identify species

Let's identify the species. Here we can use a program developed by the Center for Genomic Epidemiology to quickly identify species from either reads or an assembly. What it is doing is to use a database of 16S rRNA genes and then map the reads to the database. The reads that map to the 16S rRNA database - enriched 16S rRNA reads - are then used to perform a denovo assembly returning a full-length 16S rRNA from the sequence. This is then compared with blast to known 16S rRNAs and output is written in ssu.out.

NB: 16S rRNA is not the best method to predict species - it has an accuracy of around 80%, we could use k-mers instead. If you are interested in this you can have a look at this paper. You can try a kmer-method here

speciesfinder.py reads.pair1.cor.truncated.gz --Mbases 20 --n 2 --tax --sample pathogenic_bacteria

When it is done open the output file (inside the pathogenic_bacteria folder called ssu.out), the first line is the best match to our assembly in the database. The second column is whether we can trust the prediction or not. Which species do we think it is? This particular species is known to cause gastroenteritis with more than 90 million cases globally each year, with 155000 deaths. Most human infections are self-limiting, however, 5% of the patients will develop bacteremia - however, our strain seems to be more pathogenic.


de novo assembly

Ok now we know what species it is - but this species is found everywhere on the planet, maybe we should see if we could identify it more precisely within the species. For this we can use MLST (Multi Locus Sequence Typing) which is a method used in clinical epidemiology to type different sub-types of bacterial species. Instead of using just one Locus (=gene) as is done with 16S rRNA, it uses multiple loci to identify the type. Luckily we have a server that can do that for us, but first, we need to assemble the genome.

let's start by making a denovo assembly to identify the insert size for the final assembly. "unknown.soap.conf" is a template configuration file for SOAPdenovo2. Open it with your favorite text editor and insert the correct location of your preprocessed reads

Task 4: Make a denovo assembly using the SOAPdenovo as we did in the denovo exercises. This task has many steps - refer back to the denovo exercises for all the details. In the second run of SOAPdenovo, use a k-mer size of 65.

Now, let us filter the assembly contigs and scaffolds for minimum length of 100. We can use fastx_filterfasta.py for this.

fastx_filterfasta.py --i asmK65.contig --min 100
fastx_filterfasta.py --i asmK65.scafSeq --min 100
assemblathon_stats.pl asmK65.contig.filtered_100.fa > asmK65.contig.stats
assemblathon_stats.pl asmK65.scafSeq.filtered_100.fa > asmK65.scafSeq.stats

Look in the asmK65.contig.stats file under the "contigs" section to find the stats for the contigs, also look in the asmK65.scafSeq.stats file under the "scaffold" section to find the stats

Q4. How many contigs and scaffolds do you get and what is the N50?


MLST

We have an assembly, let's see if we can identify a Multilocus sequence type (MLST) for our strain. This works by finding the particuar MLST alleles in our genome and comparing them to a table to see if there is any match with a known sequence type. Here we are going to use the command-line version but we could just as well have used the webserver. Figure out what to use as the species name (-d option) by running mlst.pl without any inputs and make sure to exchange the assembly.fa with the scaffold sequences from our assembly.

Remember to replace the speciesname with the appropriate species name from the list you get from mlst.pl:

mlst.pl -d speciesname -i assembly.fa > mlst.res

The run takes around 5 minutes to complete. When it is complete, open the file and look at the output - there is ab=n alignment of our assembly vs. the closest match in the MLST database. You see that for the particular species we are using there are 7 loci, and it is the combination of these that make up the sequence type. Also, we require that there is a 100% match over the entire allele to identify a sequence type.

Q5. Do we have 100% matches to a sequence type? If yes, which sequence type (ST-XXX) is our pathogen?

This particular sequence type is known to be present in sub-Saharan Africa and to have increased mortality in children - in some countries with a higher death toll than malaria. This is a dangerous sequence type.


SNV based phylogeny

We can now assume that our pathogen is coming from sub-Saharan Africa, but we do not know where the particular strain is from. To get a higher resolution we can build a phylogeny based on SNPs and compare our strain to other strains. First, let us map the reads to the reference genome of our sequence type.

Alignment and BAM-processing

Task 5: Do an alignment - use the reference.fa as the reference genome, and our paired end reads as input. Use bwa index and bwa mem.

Now let us extract all reads with a mapping quality of 30 or better, sort and index the alignments. Then answer the following question. Hint: Use samtools for the first part and use bedtools genomecov and the R script /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R for for calculating the coverage. Use evince to open the pdf.

Example command (substiute your filenames for MYBAM and MY_COVERAGE):

bedtools genomecov -ibam MYBAM > MY_COVERAGE
R --vanilla MY_COVERAGE MY_COVERAGE genome < /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R
evince MY_COVERAGE.pdf &

Q6. How much of the genome is covered at minimum 10X?

Lets process the alignments - this will be explained in detail in the lectures and exercises for alignment post-processing and variant calling. We will start with removing duplicates and indexing the de-duped bam file.

Task 6: Remove duplicates from the bam file, sort it and index it. Hint: Use postprocess exercise.

Q7. How many reads were removed as duplicates?


Calling gVCF using HaplotypeCaller

Now we are ready to call SNPs. To call variants we will use the Haplotyper in GATK. We will set the ploidy to 1 (because it is a haploid organism) and this time output it as a gVCF. This means that we will also information on sites that are non-variant (eg. same as reference genome), but write them in a condensed form. We can then use this gVCF to merge with 4 other samples that I have already processed:

samtools index aln.sort.rmdup.bam
gatk --java-options "-Xmx10g" HaplotypeCaller -R reference.fa -I aln.sort.rmdup.bam -O var.raw.g.vcf -ERC GVCF 

Take a look at the output file (var.raw.g.vcf). You will see that it contains many sites that contain "NON_REF" as the variant allele and have "END=position" - these are intervals that are non-variant at a certain threshold (Genotype Quality (GQ)). We can now genotype our sample together with 4 other samples (named C-F, our sample is called ST313) that I have already process and have as gVCFs - you might as well have processed them, but this is to speed up things a bit.

cp /home/projects/22126_NGS/exercises/genomic_epi/gVCF/*.vcf .
gatk CombineGVCFs -R reference.fa -V var.raw.g.vcf -V C.raw.g.vcf -V D.raw.g.vcf -V E.raw.g.vcf -V F.raw.g.vcf -O merged.g.vcf
gatk  GenotypeGVCFs -R reference.fa -V merged.g.vcf  -O merged.vcf

Again take a look at the output (merged.vcf). You see this is a normal VCF file, but now we have information from all samples in the 5 last columns. Because we set the -ploidy 1 when we ran HaplotypeCaller the genotype is now written as "0" or "1" and not as eg. "0/1" because we only have chromosome copy.

Q8. How many variant calls did get for all of our samples? How many of those were indels (Hint: look at the REF, ALT and QUAL columns)?

Now we need to filter the SNPs, because we don't have a catalog of variation we can't do it using the Soft filtering approach (Variant-Quality-Score-Recalibration). Let us also remove the indel calls and then use the SNPs for the phylogeny:

gzip merged.vcf
select_snvs.sh merged.vcf.gz merged.snps.vcf.gz

Q9. How many SNPs do we have in our final vcf file?


Phylogenetic reconstruction

Now we have our SNPs for our outbreak strain and for several other strains. Let's create a phylogenetic tree using phyml. Here we extract all SNPs from our samples and convert them into fasta and hereafter phylip alignment that can be input to phyml. We will run 100 bootstraps.

vcf-to-tab < merged.snps.vcf > merged.snps.tab
vcf_tab_to_fasta_alignment.pl -i merged.snps.tab > merged.snps.fasta
trimal -in merged.snps.fasta -phylip > merged.snps.phylip
phyml -i merged.snps.phylip -b 100 

Let's look at the tree in FigTree. Open the program (below) and chose "File", "Open" and select "merged.snps.phylip_phyml_tree", write bootstrap in the pop-up menu and re-root on the branch going to "F". Click "Node Labels" and select "Display->bootstrap" which will show the bootstrap values on each node in the tree. Try to choose a different layout and see how it looks. You can load annotations by Selecting "File -> Import Annotations". Then go to "/home/projects/22126_NGS/exercises/genomic_epi/strain.info", hereafter click "Tip Labels -> Display and then select "info". You will now see a description of the different strains.

java -Xmx512m -jar /home/ctools/FigTree_v1.4.4/lib/figtree.jar &

Q10. Which strain is the closest to ours?

What would you use this information for to help track down the source of the outbreak?


How can we treat it?


Now that we have the de novo assembly of our strain, let's look for resistance genes. If we find any of these then we know what not to treat patients with. Here we will use another webserver from the CGE project, the ResFinder. Download the assembly to your own computer and upload them to the server. Chose a threshold of 98% and submit the job.

Q11. Are there any resistance genes in our outbreak strain?

The next step would be to investigate the differences between these and similar but non-pathogenic strains to see if we could find the course of the increased virulence, but this will be for another time.


Congratulations you finished the exercise!

Please find answers here