Data basics exercise: Difference between revisions
(Created page with " <p>This is a small exercise where we will try to identify the quality encoding of some reads.</p> <HR> <H3>Read quality encoding table</H3> We have seen that the fastq format encodes quality scores which represent the probability of an error. '''Beware''' because there are many different types of encoding for quality scores. The table below summarizes it. This table is adapted from Wikipedia article on [https://en.wikipedia.org/wiki/FASTQ_format FASTQ format]: <pre>...") |
No edit summary |
||
| Line 1: | Line 1: | ||
<p>This exercise introduces FASTQ quality encodings. Your task is to inspect three example reads and determine which encoding system was used.</p> | |||
< | <HR> | ||
< | <H3>Background: FASTQ Quality Encoding</H3> | ||
< | <p>FASTQ files store base qualities as ASCII characters. Each character represents a PHRED quality score, which indicates the probability that a base call is wrong. Different sequencing platforms historically used different encodings, so it is important to know how to interpret them.</p> | ||
<p>The table below (adapted from the FASTQ Wikipedia page) shows the ASCII ranges used by the most common quality encoding schemes:</p> | |||
<pre> | <pre> | ||
| Line 18: | Line 19: | ||
33 59 64 73 104 126 | 33 59 64 73 104 126 | ||
S - Sanger Phred+33, raw reads typically (0 | S - Sanger Phred+33, raw reads typically (0 to 40) | ||
X - Solexa Solexa+64, raw reads typically (-5 | X - Solexa Solexa+64, raw reads typically (-5 to 40) | ||
I - Illumina 1.3+ Phred+64, raw reads typically (0 | I - Illumina 1.3+ Phred+64, raw reads typically (0 to 40) | ||
J - Illumina 1.5+ Phred+64, raw reads typically (3 | J - Illumina 1.5+ Phred+64, raw reads typically (3 to 40) | ||
(with control characters for scores 0–2) | |||
L - Illumina 1.8+ Phred+33, raw reads typically (0 to 41) | |||
L - Illumina 1.8+ Phred+33, raw reads typically (0 | |||
</pre> | </pre> | ||
The line | <p>The bottom line shows all possible ASCII characters used to encode quality scores. The letters above them show which encoding schemes occupy which part of the ASCII range.</p> | ||
<p><b>Important:</b> Many older Illumina datasets use Phred+64 encoding. Modern Illumina data (1.8+ and later) always uses Phred+33.</p> | |||
<HR> | <HR> | ||
<H3>Identify | <H3>Identify the Quality Encoding</H3> | ||
<p> | <p>Using the table above, identify the encoding for the three FASTQ reads shown below. You only need to distinguish between:</p> | ||
<ul> | |||
<li>Sanger / Illumina 1.8+ (Phred+33)</li> | |||
<li>Solexa (Solexa+64)</li> | |||
<li>Illumina 1.3+ / 1.5+ (Phred+64)</li> | |||
</ul> | |||
<pre> | <pre> | ||
| Line 53: | Line 60: | ||
</pre> | </pre> | ||
<b>Q1. What encoding(s) is read 1?</ | <p><b>Q1.</b> What encoding(s) is read 1?</p> | ||
<b>Q2. What encoding(s) is read 2?</ | <p><b>Q2.</b> What encoding(s) is read 2?</p> | ||
<b>Q3. What encoding(s) is read 3?</ | <p><b>Q3.</b> What encoding(s) is read 3?</p> | ||
<b>Q4. | <p><b>Q4.</b> In which situations is it impossible to distinguish between Phred+33 and Solexa/Illumina Phred+64 encoding?</p> | ||
<b>Q5. | <p><b>Q5.</b> If you see a base with quality character “D”, what is the probability that the base is wrong (Illumina 1.8+, Phred+33)?</p> | ||
<HR> | <HR> | ||
In general, | <p>In general, always verify FASTQ quality encoding before processing older datasets. Incorrect assumptions can lead to completely wrong quality scores and downstream artifacts.</p> | ||
<p>Congratulations you finished the exercise!</p> | <p>Congratulations — you finished the exercise!</p> | ||
[[Data_basics_exercise_answers]] | [[Data_basics_exercise_answers]] | ||
Latest revision as of 13:09, 20 November 2025
This exercise introduces FASTQ quality encodings. Your task is to inspect three example reads and determine which encoding system was used.
Background: FASTQ Quality Encoding
FASTQ files store base qualities as ASCII characters. Each character represents a PHRED quality score, which indicates the probability that a base call is wrong. Different sequencing platforms historically used different encodings, so it is important to know how to interpret them.
The table below (adapted from the FASTQ Wikipedia page) shows the ASCII ranges used by the most common quality encoding schemes:
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
.................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
| | | | | |
33 59 64 73 104 126
S - Sanger Phred+33, raw reads typically (0 to 40)
X - Solexa Solexa+64, raw reads typically (-5 to 40)
I - Illumina 1.3+ Phred+64, raw reads typically (0 to 40)
J - Illumina 1.5+ Phred+64, raw reads typically (3 to 40)
(with control characters for scores 0–2)
L - Illumina 1.8+ Phred+33, raw reads typically (0 to 41)
The bottom line shows all possible ASCII characters used to encode quality scores. The letters above them show which encoding schemes occupy which part of the ASCII range.
Important: Many older Illumina datasets use Phred+64 encoding. Modern Illumina data (1.8+ and later) always uses Phred+33.
Identify the Quality Encoding
Using the table above, identify the encoding for the three FASTQ reads shown below. You only need to distinguish between:
- Sanger / Illumina 1.8+ (Phred+33)
- Solexa (Solexa+64)
- Illumina 1.3+ / 1.5+ (Phred+64)
@HWUSI-EAS656_0037_FC:3:1:16637:1035#NNNNNN/1 CATATTTTGTGGCTCATCCCAAGGGAGAGGTTTTTCTATACTCAGGAGAAGTTACTCACGATAAAGAGAA + 41?8FFF@@DAGGGEDF@FGECGGGBG@GE.EEBGBDADBBEEBEEC>ACE>CD?EEC?CAB>EB:BC## @FC42RW0AAXX:3:1:2:1038#NNNNNN/1 GTGTTCTCTGCGACCCGTAATTCAGCTTTTTCCGGTTGCTTTGCCCTTTGCACCTTATCCTGCACCATCTCGC + a]baaaa`aaaV`a_aa^Y^`_`_aa___`a]U__\\`][Z_^^R]YWWW[SWZ[QFY[VVWZWBBBBBBBBB @I330_1_FC30JM6AAXX:4:1:13:1602/1 ATGTAGAAGTGTTTGATACGGCGATTTCAAACATTGCAGGGCTT +I330_1_FC30JM6AAXX:4:1:13:1602/1 hhhhhhhhhhhhhhhhhhYh^hhhH[I>B^AABGDK;KBP??FN
Q1. What encoding(s) is read 1?
Q2. What encoding(s) is read 2?
Q3. What encoding(s) is read 3?
Q4. In which situations is it impossible to distinguish between Phred+33 and Solexa/Illumina Phred+64 encoding?
Q5. If you see a base with quality character “D”, what is the probability that the base is wrong (Illumina 1.8+, Phred+33)?
In general, always verify FASTQ quality encoding before processing older datasets. Incorrect assumptions can lead to completely wrong quality scores and downstream artifacts.
Congratulations — you finished the exercise!