Genomic epidemiology solution
Q1. We can use the fastx tools to get this information
fastx_readlength.py --i reads_R1.fastq.gz --gz fastx_readlength.py --i reads_R2.fastq.gz --gz
This should give you the answer. Reads: 2*1408847 = 2,817,694; Bases: 284587094
Task1 - can be accomplished using a quality control tool like fastqc.
mkdir fastqc fastqc -o fastqc reads_R*.fastq.gz firefox fastqc/reads_R1_fastqc.html fastqc/reads_R2_fastqc.html &
Q2. There are no overrepresented sequences, but to ensure that there are no stray adapters at all, we will still perform adapter removal using AdapterRemoval - you can use any other tool to perform this task, e.g. cutadapt, leeHom etc.
Task2 - can be accomplished using AdapterRemoval
AdapterRemoval --file1 reads_R1.fastq.gz --file2 reads_R2.fastq.gz --adapter1 ATCGGAAGAGCACACGTCTGAACTCCAGTCACATTCCTATCTCGTATGCC --adapter2 GATCGGAAGAGCACACGTCTGAACTCCAGTCACATGAGCATCTCGTATGC --basename reads --gzip --trimqualities --minquality 20 --minlength 40
Q3. You can use this command to generate the histogram of the kmer coverages.
gzip -dc reads.pair*.truncated.gz | jellyfish count -t 2 -m 15 -s 1000000000 -o reads_jellyfish -C /dev/fd/0 jellyfish histo reads_jellyfish > reads.histo
Now we can use R to plot it. Start R and use these commands to plot the histogram and save it as a pdf.
dat=read.table("reads.histo") barplot(dat[,2], xlim=c(0,150), ylim=c(0,5e5), ylab="No of kmers", xlab="Counts of a k-mer", names.arg=dat[,1], cex.names=0.8) dev.print("reads.histo.pdf", device=pdf)
You can see from the plot that the average k-mer coverage ~ 41; so yes the depth is high enough to do error correction.
Task 3: Use jellyfish and musket - similar to the denovo exercise - to do error correction.
jellyfish stats reads_jellyfish musket -k 15 XYZ -p 1 -omulti reads.cor -inorder reads.pair1.truncated.gz reads.pair2.truncated.gz -zlib 1 mv reads.cor.0 reads.pair1.cor.truncated.gz mv reads.cor.1 reads.pair2.cor.truncated.gz # If Musket wont run copy the data to your directory: # cp /home/projects/22126_NGS/exercises/genomic_epi/reads.pair*.cor.truncated.gz .
Task 4: Make an assembly using SOAPdenovo
SOAPdenovo-127mer pregraph -s unknown.soap.conf -K 35 -p 2 -o initial SOAPdenovo-127mer contig -g initial
Now you should have a file called "initial.contig", we need to map our reads back to the contigs to identify the insert size, just as we did in the alignment exercise. Lets only map the first 100.000 reads - this should be enough.
zcat reads.pair1.cor.truncated.gz | head -n 100000 > reads_sample_1.fastq zcat reads.pair2.cor.truncated.gz | head -n 100000 > reads_sample_2.fastq bwa index initial.contig bwa mem initial.contig reads_sample_1.fastq reads_sample_2.fastq | samtools view -Sb - > initial.sample.bam samtools view initial.sample.bam | cut -f9 > initial.insertsizes.txt R a = read.table("initial.insertsizes.txt") a.v = a[a[,1]>0,1] mn = quantile(a.v, seq(0,1,0.05))[4] mx = quantile(a.v, seq(0,1,0.05))[18] mean(a.v[a.v >= mn & a.v <= mx]) # mean sd(a.v[a.v >= mn & a.v <= mx]) # sd
Update the unknown.soap.conf with the correct insert size and run de novo assembly of the reads using K=65 (this a fairly good K for this data).
SOAPdenovo-127mer all -s unknown.soap.conf -K 65 -p 2 -o asmK65
Q4. Scaffolds: 150, Contigs: 416, Scaffold N50: 376655, Contig N50: 70749
Q5. Yes, ST-313.
Task 5: Alignment to the sequence type reference genome.
bwa index reference.fa bwa mem -t 2 -R "@RG\tID:ST313\tSM:ST313\tPL:ILLUMINA" reference.fa reads.pair1.truncated.gz \ reads.pair2.truncated.gz | samtools view -Sb - > aln.bam
Q6.
samtools view -u -q 30 aln.bam | samtools sort -O BAM -o aln.sort.bam - samtools index aln.sort.bam bedtools genomecov -ibam aln.sort.bam > aln.cov R --vanilla aln.cov aln.cov genome < /home/projects/22126_NGS/exercises/genomic_epi/programs/plotcov.R evince aln.cov.pdf &
Around 99% of the genome is at 10x or better coverage.
Task 6: Remove duplicates, sort the resulting bam and index it.
java -Xmx2g -jar /home/ctools/picard_2.23.8/picard.jar MarkDuplicates INPUT=aln.sort.bam OUTPUT=aln.sort.rmdup.bam ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true samtools index aln.sort.rmdup.bam
Q7. 104517 reads are duplicates
Q8. 62 variants was called, 13 are indels.
Q9. 49 SNPs
Q10. Our strain is the closest to "C". A Congo 2002 strain of Salmonella Typhimurium ST313.
Q11. Aminoglycoside, Beta-lactam, Sulphonamid and Trimethoprim