Genomic epidemiology solution

From 22126
Jump to navigation Jump to search

Q1. We can use the fastx tools to get this information

fastx_readlength.py --i reads_R1.fastq.gz --gz
fastx_readlength.py --i reads_R2.fastq.gz --gz

This should give you the answer. Reads: 2*1408847 = 2,817,694; Bases: 284587094

Task1 - can be accomplished using a quality control tool like fastqc.

mkdir fastqc
fastqc -o fastqc reads_R*.fastq.gz
firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html &

Q2. There are no overrepresented sequences, but to ensure that there are no stray adapters at all, we will still perform adapter removal using AdapterRemoval - you can use any other tool to perform this task, e.g. cutadapt, leeHom etc.

Task2 - can be accomplished using AdapterRemoval

AdapterRemoval --file1 reads_R1.fastq.gz --file2 reads_R2.fastq.gz --adapter1 ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC    --adapter2 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC --basename reads --gzip --trimqualities --minquality 20 --minlength 40

Q3. You can use this command to generate the histogram of the kmer coverages.

gzip -dc reads.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o reads_jellyfish -C /dev/fd/0
jellyfish histo reads_jellyfish > reads.histo

Now we can use R to plot it. Start R and use these commands to plot the histogram and save it as a pdf.

dat=read.table("reads.histo")
barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8)
dev.print("reads.histo.pdf", device=pdf)

You can see from the plot that the average k-mer coverage ~ 41; so yes the depth is high enough to do error correction.

Task 3: Use jellyfish and musket - similar to the denovo exercise - to do error correction.

jellyfish stats reads_jellyfish

musket -k 15 XYZ -p 1 -omulti reads.cor -inorder reads.pair1.truncated.gz reads.pair2.truncated.gz -zlib 1
mv reads.cor.0 reads.pair1.cor.truncated.gz
mv reads.cor.1 reads.pair2.cor.truncated.gz

# If Musket wont run copy the data to your directory:
# cp /home/projects/22126_NGS/exercises/genomic_epi/reads.pair*.cor.truncated.gz .


Task 4: Make an assembly using SOAPdenovo

SOAPdenovo-127mer pregraph -s unknown.soap.conf -K 35 -p 2 -o initial
SOAPdenovo-127mer contig -g initial

Now you should have a file called "initial.contig", we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Lets only map the first 100.000 reads - this should be enough.

zcat reads.pair1.cor.truncated.gz | head -n 100000 > reads_sample_1.fastq
zcat reads.pair2.cor.truncated.gz | head -n 100000 > reads_sample_2.fastq

bwa index initial.contig
bwa mem initial.contig reads_sample_1.fastq reads_sample_2.fastq | samtools view -Sb - > initial.sample.bam

samtools view initial.sample.bam | cut -f9 > initial.insertsizes.txt

R
a = read.table("initial.insertsizes.txt")
a.v = a[a[,1]>0,1]
mn = quantile(a.v, seq(0,1,0.05))[4]
mx = quantile(a.v, seq(0,1,0.05))[18]
mean(a.v[a.v >= mn & a.v <= mx])       # mean
sd(a.v[a.v >= mn & a.v <= mx])         # sd

Update the unknown.soap.conf with the correct insert size and run de novo assembly of the reads using K=65 (this a fairly good K for this data).

SOAPdenovo-127mer all -s unknown.soap.conf -K 65 -p 2 -o asmK65


Q4. Scaffolds: 150, Contigs: 416, Scaffold N50: 376655, Contig N50: 70749

Q5. Yes, ST-313.

Task 5: Alignment to the sequence type reference genome.

bwa index reference.fa
bwa mem -t 2 -R "@RG\tID:ST313\tSM:ST313\tPL:ILLUMINA" reference.fa reads.pair1.truncated.gz \
 reads.pair2.truncated.gz | samtools view -Sb - > aln.bam

Q6.

samtools view -u -q 30 aln.bam | samtools sort -O BAM -o aln.sort.bam - 
samtools index aln.sort.bam

bedtools genomecov -ibam aln.sort.bam > aln.cov
R --vanilla aln.cov aln.cov genome < /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R
evince aln.cov.pdf &

Around 99% of the genome is at 10x or better coverage.

Task 6: Remove duplicates, sort the resulting bam and index it.

java -Xmx2g -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates INPUT=aln.sort.bam OUTPUT=aln.sort.rmdup.bam    ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true
samtools index aln.sort.rmdup.bam 

Q7. 104517 reads are duplicates

Q8. 62 variants was called, 13 are indels.

Q9. 49 SNPs

Q10. Our strain is the closest to "C". A Congo 2002 strain of Salmonella Typhimurium ST313.

Q11. Aminoglycoside, Beta-lactam, Sulphonamid and Trimethoprim