Read trimmer for Next-Generation-Sequencing data

From 22113
Jump to navigation Jump to search

Description

The advent of Next Generation Sequencing (NGS) technologies have transformed how biological research is being performed and today almost all biological fields use the technology for cutting edge discoveries. Today, a human genome can be sequenced in very short time for approximately $1000 giving unprecedented possibilities for investigating human traits, evolution and diseases. Similarly whole bacterial communities and their interplay with the environment can be studied, unravelling novel enzymes and organisms. These experiments produces massive amounts of data that often including a lot of noise and the program therefore has to be able to clean up NGS data.

Input and output

The input and output data is Illumina fastq files, which is very similar to a fasta file except that they also contain a quality associated to each base so that one entry is exactly four lines which is also know as a 'read':

Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line).
Line 2 is the raw sequence letters.
Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence.

This is an example:

@ILLUMINA-3BDE4F_0027:2:1:12594:2417#TGACCA/1
GTTACTTGTGTCGTTGTAGACACTNCTGATACCTCCAGCATGCCTCACAGCACACCTTCGCAGGCTTACAGAACGCTCCCCTACCCAACAACACATAGTGT
+
a\EDZKOGNDPDIJDKFFNF]Z]`BaaaZaZEZVcaaX]accccccccccccccccccccccccc[ccacccccccccccccabcca_c_O____a_aXaV

Each of the base-qualities (fourth line) encodes the probability that the base at the same position is wrong using the ascii table for conversion between the characters and a number (probability). The input data contains millions (and up to billions) of these reads and they must be cleaned according to user specified settings.

Details

The program must be able to:

  1. Read both compressed and uncompressed fastq files and be able to write both compressed or uncompressed files according to user input (gzip only).
  2. Be able to use both Phred+33 and Phred+64 encoding according to user input.
  3. Autodetect the quality scale of a file (phred+33 or phred+64).
  4. Trim X nucleotides from the 3' of each read given user input.
  5. Trim each read from the 3' based on quality, either as minimum (single residue) or mean of moving window.
  6. These trims from 3' can be done independently of each other in above order.
  7. Trim X nucleotides from the 5' of each read given user input.
  8. Trim each read from the 5' based on quality, either as minimum (single residue) or mean of moving window.
  9. These trims from 5' can be done independently of each other in above order.
  10. Filter out reads with a mean quality lower than specified after trimming.
  11. Filter out reads that are shorter than specified after trimming.
  12. Filter out reads that have a more than a specified number of N bases (unknown bases).
  13. Must keep track of how many reads are trimmed, removed etc. Put the result in a (user specified) log file.

Optionally advanced options can be implemented into the program:

  1. Autodetect if input files are gzipped or not.
  2. Trim reads from the 3' and/or 5' based on quality as minimum of moving window, i.e. if window contains any low quality, trim and move.
  3. Calculate statistics on a fastq/fasta file without trimming such as number of entries, number of each bases, length of entries, average length of entries, quality (average, best/worst 10%), etc.
  4. NGS files are big, hence they take a long time to process. Optimize the program for performance. Requires lots of testing and rewriting code.
  5. Trim paired end reads simultaneously keeping pairs in sync between the two files. This complicated the process, see below.

Since NGS files are so big, the resulting program must be efficient in RAM usage and fast due to the millions of reads in real data sets.

The program has to be user friendly, but should only be based on command line execution, i.e. not an interactive user interface. You can use the Python standard library argparse to deal with the many options.

Paired end reads

When the a long DNA molecule is sequenced from both ends this will yield two files, one with reads from the 5' and one with the reads from the 3' of the DNA molecule. Programs will assume that paired reads will have the same position in the two files. When trimming the two files they have to be kept in sync, eg. if a read is removed from one file, then the corresponding pair needs to be removed from the other file to maintain synchronism between the files.

Here is a couple of paired fastq files. They are much smaller than normal as they only serve as an example.
BRISCOE_0069_BD18RUACXX_L2_1_pf.fastq and BRISCOE_0069_BD18RUACXX_L2_2_pf.fastq
Here is a real sample. It is still 100 times smaller than usual: Sample 1 and Sample 2

References

  1. FASTQ format
  2. Phred quality score