<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://teaching.healthtech.dtu.dk/22126/index.php?action=history&amp;feed=atom&amp;title=Genomic_epidemiology_solution</id>
	<title>Genomic epidemiology solution - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://teaching.healthtech.dtu.dk/22126/index.php?action=history&amp;feed=atom&amp;title=Genomic_epidemiology_solution"/>
	<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22126/index.php?title=Genomic_epidemiology_solution&amp;action=history"/>
	<updated>2026-05-01T08:16:27Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22126/index.php?title=Genomic_epidemiology_solution&amp;diff=66&amp;oldid=prev</id>
		<title>WikiSysop: Created page with &quot;Q1. We can use the fastx tools to get this information &lt;pre&gt; fastx_readlength.py --i reads_R1.fastq.gz --gz fastx_readlength.py --i reads_R2.fastq.gz --gz &lt;/pre&gt; This should give you the answer.  Reads: 2*1408847 = 2,817,694; Bases: 284587094  Task1 - can be accomplished using a quality control tool like fastqc. &lt;pre&gt; mkdir fastqc fastqc -o fastqc reads_R*.fastq.gz firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html &amp; &lt;/pre&gt;  Q2. There are no overrepresented...&quot;</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22126/index.php?title=Genomic_epidemiology_solution&amp;diff=66&amp;oldid=prev"/>
		<updated>2024-03-19T15:56:58Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;Q1. We can use the fastx tools to get this information &amp;lt;pre&amp;gt; fastx_readlength.py --i reads_R1.fastq.gz --gz fastx_readlength.py --i reads_R2.fastq.gz --gz &amp;lt;/pre&amp;gt; This should give you the answer.  Reads: 2*1408847 = 2,817,694; Bases: 284587094  Task1 - can be accomplished using a quality control tool like fastqc. &amp;lt;pre&amp;gt; mkdir fastqc fastqc -o fastqc reads_R*.fastq.gz firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html &amp;amp; &amp;lt;/pre&amp;gt;  Q2. There are no overrepresented...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;Q1. We can use the fastx tools to get this information&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
fastx_readlength.py --i reads_R1.fastq.gz --gz&lt;br /&gt;
fastx_readlength.py --i reads_R2.fastq.gz --gz&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
This should give you the answer. &lt;br /&gt;
Reads: 2*1408847 = 2,817,694; Bases: 284587094&lt;br /&gt;
&lt;br /&gt;
Task1 - can be accomplished using a quality control tool like fastqc.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
mkdir fastqc&lt;br /&gt;
fastqc -o fastqc reads_R*.fastq.gz&lt;br /&gt;
firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html &amp;amp;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Q2. There are no overrepresented sequences, but to ensure that there are no stray adapters at all, we will still perform adapter removal using AdapterRemoval - you can use any other tool to perform this task, e.g. cutadapt, leeHom etc.&lt;br /&gt;
&lt;br /&gt;
Task2 - can be accomplished using AdapterRemoval&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
AdapterRemoval --file1 reads_R1.fastq.gz --file2 reads_R2.fastq.gz --adapter1 ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC    --adapter2 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC --basename reads --gzip --trimqualities --minquality 20 --minlength 40&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Q3. You can use this command to generate the histogram of the kmer coverages. &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
gzip -dc reads.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o reads_jellyfish -C /dev/fd/0&lt;br /&gt;
jellyfish histo reads_jellyfish &amp;gt; reads.histo&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Now we can use R to plot it. Start R and use these commands to plot the histogram and save it as a pdf.  &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
dat=read.table(&amp;quot;reads.histo&amp;quot;)&lt;br /&gt;
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab=&amp;quot;No of kmers&amp;quot;, xlab=&amp;quot;Counts of a k-mer&amp;quot;, names.arg=dat[,1], cex.names=0.8)&lt;br /&gt;
dev.print(&amp;quot;reads.histo.pdf&amp;quot;, device=pdf)&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
You can see from the plot that the average k-mer coverage ~ 41; so yes the depth is high enough to do error correction.&lt;br /&gt;
&lt;br /&gt;
Task 3: Use jellyfish and musket - similar to the denovo exercise - to do error correction. &amp;lt;pre&amp;gt;&lt;br /&gt;
jellyfish stats reads_jellyfish&lt;br /&gt;
&lt;br /&gt;
musket -k 15 XYZ -p 1 -omulti reads.cor -inorder reads.pair1.truncated.gz reads.pair2.truncated.gz -zlib 1&lt;br /&gt;
mv reads.cor.0 reads.pair1.cor.truncated.gz&lt;br /&gt;
mv reads.cor.1 reads.pair2.cor.truncated.gz&lt;br /&gt;
&lt;br /&gt;
# If Musket wont run copy the data to your directory:&lt;br /&gt;
# cp /home/projects/22126_NGS/exercises/genomic_epi/reads.pair*.cor.truncated.gz .&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Task 4: Make an assembly using SOAPdenovo&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
SOAPdenovo-127mer pregraph -s unknown.soap.conf -K 35 -p 2 -o initial&lt;br /&gt;
SOAPdenovo-127mer contig -g initial&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;Now you should have a file called &amp;quot;initial.contig&amp;quot;, we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Lets only map the first 100.000 reads - this should be enough.&amp;lt;/p&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
zcat reads.pair1.cor.truncated.gz | head -n 100000 &amp;gt; reads_sample_1.fastq&lt;br /&gt;
zcat reads.pair2.cor.truncated.gz | head -n 100000 &amp;gt; reads_sample_2.fastq&lt;br /&gt;
&lt;br /&gt;
bwa index initial.contig&lt;br /&gt;
bwa mem initial.contig reads_sample_1.fastq reads_sample_2.fastq | samtools view -Sb - &amp;gt; initial.sample.bam&lt;br /&gt;
&lt;br /&gt;
samtools view initial.sample.bam | cut -f9 &amp;gt; initial.insertsizes.txt&lt;br /&gt;
&lt;br /&gt;
R&lt;br /&gt;
a = read.table(&amp;quot;initial.insertsizes.txt&amp;quot;)&lt;br /&gt;
a.v = a[a[,1]&amp;gt;0,1]&lt;br /&gt;
mn = quantile(a.v, seq(0,1,0.05))[4]&lt;br /&gt;
mx = quantile(a.v, seq(0,1,0.05))[18]&lt;br /&gt;
mean(a.v[a.v &amp;gt;= mn &amp;amp; a.v &amp;lt;= mx])       # mean&lt;br /&gt;
sd(a.v[a.v &amp;gt;= mn &amp;amp; a.v &amp;lt;= mx])         # sd&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;!--&lt;br /&gt;
m = 320.5364&lt;br /&gt;
sd = 7.8&lt;br /&gt;
--&amp;gt;&lt;br /&gt;
&amp;lt;p&amp;gt;Update the unknown.soap.conf with the correct insert size and run de novo assembly of the reads using K=65 (this a fairly good K for this data).&amp;lt;/p&amp;gt;&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
SOAPdenovo-127mer all -s unknown.soap.conf -K 65 -p 2 -o asmK65&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Q4. Scaffolds: 150, Contigs: 416, Scaffold N50: 376655, Contig N50: 70749&lt;br /&gt;
&lt;br /&gt;
Q5. Yes, ST-313.&lt;br /&gt;
&lt;br /&gt;
Task 5: Alignment to the sequence type reference genome.&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
bwa index reference.fa&lt;br /&gt;
bwa mem -t 2 -R &amp;quot;@RG\tID:ST313\tSM:ST313\tPL:ILLUMINA&amp;quot; reference.fa reads.pair1.truncated.gz \&lt;br /&gt;
 reads.pair2.truncated.gz | samtools view -Sb - &amp;gt; aln.bam&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Q6. &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
samtools view -u -q 30 aln.bam | samtools sort -O BAM -o aln.sort.bam - &lt;br /&gt;
samtools index aln.sort.bam&lt;br /&gt;
&lt;br /&gt;
bedtools genomecov -ibam aln.sort.bam &amp;gt; aln.cov&lt;br /&gt;
R --vanilla aln.cov aln.cov genome &amp;lt; /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R&lt;br /&gt;
evince aln.cov.pdf &amp;amp;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Around 99% of the genome is at 10x or better coverage.&lt;br /&gt;
&lt;br /&gt;
Task 6: Remove duplicates, sort the resulting bam and index it. &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
java -Xmx2g -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates INPUT=aln.sort.bam OUTPUT=aln.sort.rmdup.bam    ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true&lt;br /&gt;
samtools index aln.sort.rmdup.bam &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Q7. 104517 reads are duplicates&lt;br /&gt;
&lt;br /&gt;
Q8. 62 variants was called, 13 are indels.&lt;br /&gt;
&lt;br /&gt;
Q9. 49 SNPs&lt;br /&gt;
&lt;br /&gt;
Q10. Our strain is the closest to &amp;quot;C&amp;quot;. A Congo 2002 strain of Salmonella Typhimurium ST313.&lt;br /&gt;
&lt;br /&gt;
Q11. Aminoglycoside, Beta-lactam, Sulphonamid and Trimethoprim&lt;/div&gt;</summary>
		<author><name>WikiSysop</name></author>
	</entry>
</feed>