Data basics exercise

From 22126
Jump to navigation Jump to search

This is a small exercise where we will try to identify the quality encoding of some reads.


Read quality encoding table

We have seen that the fastq format encodes quality scores which represent the probability of an error. Beware because there are many different types of encoding for quality scores. The table below summarizes it. This table is adapted from Wikipedia article on FASTQ format:

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
    (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

The line beginning with an exclamation mark indicates potential characters. You can see which characters are used by a particular encoding. Note that many third generation sequencing technologies produce quality scores that are not capped at 40/41, but go beyond that upper quality score threshold.


Identify quality encoding

Use the table above (the table is from the Fastq Wikipedia page) to identify the quality encoding of these three reads. You only have to differentiate between Sanger (S), Solexa (X) and Illumina (I, J).


@HWUSI-EAS656_0037_FC:3:1:16637:1035#NNNNNN/1
CATATTTTGTGGCTCATCCCAAGGGAGAGGTTTTTCTATACTCAGGAGAAGTTACTCACGATAAAGAGAA
+
41?8FFF@@DAGGGEDF@FGECGGGBG@GE.EEBGBDADBBEEBEEC>ACE>CD?EEC?CAB>EB:BC##

@FC42RW0AAXX:3:1:2:1038#NNNNNN/1
GTGTTCTCTGCGACCCGTAATTCAGCTTTTTCCGGTTGCTTTGCCCTTTGCACCTTATCCTGCACCATCTCGC
+
a]baaaa`aaaV`a_aa^Y^`_`_aa___`a]U__\\`][Z_^^R]YWWW[SWZ[QFY[VVWZWBBBBBBBBB

@I330_1_FC30JM6AAXX:4:1:13:1602/1
ATGTAGAAGTGTTTGATACGGCGATTTCAAACATTGCAGGGCTT
+I330_1_FC30JM6AAXX:4:1:13:1602/1
hhhhhhhhhhhhhhhhhhYh^hhhH[I>B^AABGDK;KBP??FN

Q1. What encoding(s) is read 1?
Q2. What encoding(s) is read 2?
Q3. What encoding(s) is read 3?
Q4. Can you think of situations were it is not possible to differentiate between Phred+33 and Solexa+64 quality encoding?
Q5. You see a base with a quality score associate with it of 'D', what is the probability that this base is wrong assuming Illumina 1.8+ Phred+33?


In general, we recommend verifying the encoding before starting to work especially with older data.

Congratulations you finished the exercise!

Data_basics_exercise_answers