WikiSysop: Created page with "Q1. We can use the fastx tools to get this information

 fastx_readlength.py --i reads_R1.fastq.gz --gz fastx_readlength.py --i reads_R2.fastq.gz --gz

This should give you the answer. Reads: 2*1408847 = 2,817,694; Bases: 284587094 Task1 - can be accomplished using a quality control tool like fastqc.

 mkdir fastqc fastqc -o fastqc reads_R*.fastq.gz firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html &

Q2. There are no overrepresented..."

2024-03-19T15:56:58Z

Created page with "Q1. We can use the fastx tools to get this information <pre> fastx_readlength.py --i reads_R1.fastq.gz --gz fastx_readlength.py --i reads_R2.fastq.gz --gz </pre> This should give you the answer. Reads: 2*1408847 = 2,817,694; Bases: 284587094 Task1 - can be accomplished using a quality control tool like fastqc. <pre> mkdir fastqc fastqc -o fastqc reads_R*.fastq.gz firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html & </pre> Q2. There are no overrepresented..."

New page

Q1. We can use the fastx tools to get this information
<pre>
fastx_readlength.py --i reads_R1.fastq.gz --gz
fastx_readlength.py --i reads_R2.fastq.gz --gz
</pre>
This should give you the answer.
Reads: 2*1408847 = 2,817,694; Bases: 284587094

Task1 - can be accomplished using a quality control tool like fastqc.
<pre>
mkdir fastqc
fastqc -o fastqc reads_R*.fastq.gz
firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html &
</pre>

Q2. There are no overrepresented sequences, but to ensure that there are no stray adapters at all, we will still perform adapter removal using AdapterRemoval - you can use any other tool to perform this task, e.g. cutadapt, leeHom etc.

Task2 - can be accomplished using AdapterRemoval
<pre>
AdapterRemoval --file1 reads_R1.fastq.gz --file2 reads_R2.fastq.gz --adapter1 ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC --adapter2 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC --basename reads --gzip --trimqualities --minquality 20 --minlength 40
</pre>

Q3. You can use this command to generate the histogram of the kmer coverages.
<pre>
gzip -dc reads.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o reads_jellyfish -C /dev/fd/0
jellyfish histo reads_jellyfish > reads.histo
</pre>
Now we can use R to plot it. Start R and use these commands to plot the histogram and save it as a pdf.
<pre>
dat=read.table("reads.histo")
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8)
dev.print("reads.histo.pdf", device=pdf)
</pre>
You can see from the plot that the average k-mer coverage ~ 41; so yes the depth is high enough to do error correction.

Task 3: Use jellyfish and musket - similar to the denovo exercise - to do error correction. <pre>
jellyfish stats reads_jellyfish

musket -k 15 XYZ -p 1 -omulti reads.cor -inorder reads.pair1.truncated.gz reads.pair2.truncated.gz -zlib 1
mv reads.cor.0 reads.pair1.cor.truncated.gz
mv reads.cor.1 reads.pair2.cor.truncated.gz

# If Musket wont run copy the data to your directory:
# cp /home/projects/22126_NGS/exercises/genomic_epi/reads.pair*.cor.truncated.gz .
</pre>

Task 4: Make an assembly using SOAPdenovo
<pre>
SOAPdenovo-127mer pregraph -s unknown.soap.conf -K 35 -p 2 -o initial
SOAPdenovo-127mer contig -g initial
</pre>
<p>Now you should have a file called "initial.contig", we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Lets only map the first 100.000 reads - this should be enough.</p>

<pre>
zcat reads.pair1.cor.truncated.gz | head -n 100000 > reads_sample_1.fastq
zcat reads.pair2.cor.truncated.gz | head -n 100000 > reads_sample_2.fastq

bwa index initial.contig
bwa mem initial.contig reads_sample_1.fastq reads_sample_2.fastq | samtools view -Sb - > initial.sample.bam

samtools view initial.sample.bam | cut -f9 > initial.insertsizes.txt

R
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx]) # mean
sd(a.v[a.v >= mn & a.v <= mx]) # sd
</pre>


<p>Update the unknown.soap.conf with the correct insert size and run de novo assembly of the reads using K=65 (this a fairly good K for this data).</p>
<pre>
SOAPdenovo-127mer all -s unknown.soap.conf -K 65 -p 2 -o asmK65
</pre>

Q4. Scaffolds: 150, Contigs: 416, Scaffold N50: 376655, Contig N50: 70749

Q5. Yes, ST-313.

Task 5: Alignment to the sequence type reference genome.
<pre>
bwa index reference.fa
bwa mem -t 2 -R "@RG\tID:ST313\tSM:ST313\tPL:ILLUMINA" reference.fa reads.pair1.truncated.gz \
reads.pair2.truncated.gz | samtools view -Sb - > aln.bam
</pre>

Q6.
<pre>
samtools view -u -q 30 aln.bam | samtools sort -O BAM -o aln.sort.bam -
samtools index aln.sort.bam

bedtools genomecov -ibam aln.sort.bam > aln.cov
R --vanilla aln.cov aln.cov genome < /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R
evince aln.cov.pdf &
</pre>

Around 99% of the genome is at 10x or better coverage.

Task 6: Remove duplicates, sort the resulting bam and index it.
<pre>
java -Xmx2g -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates INPUT=aln.sort.bam OUTPUT=aln.sort.rmdup.bam ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
samtools index aln.sort.rmdup.bam
</pre>

Q7. 104517 reads are duplicates

Q8. 62 variants was called, 13 are indels.

Q9. 49 SNPs

Q10. Our strain is the closest to "C". A Congo 2002 strain of Salmonella Typhimurium ST313.

Q11. Aminoglycoside, Beta-lactam, Sulphonamid and Trimethoprim

Genomic epidemiology solution - Revision history