Data basics exercise

From 22126
Revision as of 13:09, 20 November 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This exercise introduces FASTQ quality encodings. Your task is to inspect three example reads and determine which encoding system was used.


Background: FASTQ Quality Encoding

FASTQ files store base qualities as ASCII characters. Each character represents a PHRED quality score, which indicates the probability that a base call is wrong. Different sequencing platforms historically used different encodings, so it is important to know how to interpret them.

The table below (adapted from the FASTQ Wikipedia page) shows the ASCII ranges used by the most common quality encoding schemes:

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0 to 40)
 X - Solexa        Solexa+64, raw reads typically (-5 to 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0 to 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3 to 40)
     (with control characters for scores 0–2)
 L - Illumina 1.8+ Phred+33,  raw reads typically (0 to 41)

The bottom line shows all possible ASCII characters used to encode quality scores. The letters above them show which encoding schemes occupy which part of the ASCII range.

Important: Many older Illumina datasets use Phred+64 encoding. Modern Illumina data (1.8+ and later) always uses Phred+33.


Identify the Quality Encoding

Using the table above, identify the encoding for the three FASTQ reads shown below. You only need to distinguish between:

  • Sanger / Illumina 1.8+ (Phred+33)
  • Solexa (Solexa+64)
  • Illumina 1.3+ / 1.5+ (Phred+64)
@HWUSI-EAS656_0037_FC:3:1:16637:1035#NNNNNN/1
CATATTTTGTGGCTCATCCCAAGGGAGAGGTTTTTCTATACTCAGGAGAAGTTACTCACGATAAAGAGAA
+
41?8FFF@@DAGGGEDF@FGECGGGBG@GE.EEBGBDADBBEEBEEC>ACE>CD?EEC?CAB>EB:BC##

@FC42RW0AAXX:3:1:2:1038#NNNNNN/1
GTGTTCTCTGCGACCCGTAATTCAGCTTTTTCCGGTTGCTTTGCCCTTTGCACCTTATCCTGCACCATCTCGC
+
a]baaaa`aaaV`a_aa^Y^`_`_aa___`a]U__\\`][Z_^^R]YWWW[SWZ[QFY[VVWZWBBBBBBBBB

@I330_1_FC30JM6AAXX:4:1:13:1602/1
ATGTAGAAGTGTTTGATACGGCGATTTCAAACATTGCAGGGCTT
+I330_1_FC30JM6AAXX:4:1:13:1602/1
hhhhhhhhhhhhhhhhhhYh^hhhH[I>B^AABGDK;KBP??FN

Q1. What encoding(s) is read 1?

Q2. What encoding(s) is read 2?

Q3. What encoding(s) is read 3?

Q4. In which situations is it impossible to distinguish between Phred+33 and Solexa/Illumina Phred+64 encoding?

Q5. If you see a base with quality character “D”, what is the probability that the base is wrong (Illumina 1.8+, Phred+33)?


In general, always verify FASTQ quality encoding before processing older datasets. Incorrect assumptions can lead to completely wrong quality scores and downstream artifacts.

Congratulations — you finished the exercise!

Data_basics_exercise_answers