Data basics exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with " <p>This is a small exercise where we will try to identify the quality encoding of some reads.</p> <HR> <H3>Read quality encoding table</H3> We have seen that the fastq format encodes quality scores which represent the probability of an error. '''Beware''' because there are many different types of encoding for quality scores. The table below summarizes it. This table is adapted from Wikipedia article on [https://en.wikipedia.org/wiki/FASTQ_format FASTQ format]: <pre>...")
 
No edit summary
 
Line 1: Line 1:
<p>This exercise introduces FASTQ quality encodings. Your task is to inspect three example reads and determine which encoding system was used.</p>


<p>This is a small exercise where we will try to identify the quality encoding of some reads.</p>
<HR>


<HR>
<H3>Background: FASTQ Quality Encoding</H3>


<H3>Read quality encoding table</H3>
<p>FASTQ files store base qualities as ASCII characters. Each character represents a PHRED quality score, which indicates the probability that a base call is wrong. Different sequencing platforms historically used different encodings, so it is important to know how to interpret them.</p>


We have seen that the fastq format encodes quality scores which represent the probability of an error. '''Beware''' because there are many different types of encoding for quality scores. The table below summarizes it. This table is adapted from Wikipedia article on [https://en.wikipedia.org/wiki/FASTQ_format FASTQ format]:
<p>The table below (adapted from the FASTQ Wikipedia page) shows the ASCII ranges used by the most common quality encoding schemes:</p>


<pre>
<pre>
Line 18: Line 19:
  33                        59  64      73                            104                  126
  33                        59  64      73                            104                  126


  S - Sanger        Phred+33,  raw reads typically (0, 40)
  S - Sanger        Phred+33,  raw reads typically (0 to 40)
  X - Solexa        Solexa+64, raw reads typically (-5, 40)
  X - Solexa        Solexa+64, raw reads typically (-5 to 40)
  I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
  I - Illumina 1.3+ Phred+64,  raw reads typically (0 to 40)
  J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
  J - Illumina 1.5+ Phred+64,  raw reads typically (3 to 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)  
    (with control characters for scores 0–2)
    (Note: See discussion above).
  L - Illumina 1.8+ Phred+33,  raw reads typically (0 to 41)
  L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)
</pre>
</pre>


The line beginning with an exclamation mark indicates potential characters. You can see which characters are used by a particular encoding. Note that many third generation sequencing technologies produce quality scores that are not capped at 40/41, but go beyond that upper quality score threshold.
<p>The bottom line shows all possible ASCII characters used to encode quality scores. The letters above them show which encoding schemes occupy which part of the ASCII range.</p>
 
<p><b>Important:</b> Many older Illumina datasets use Phred+64 encoding. Modern Illumina data (1.8+ and later) always uses Phred+33.</p>


<HR>
<HR>


<H3>Identify quality encoding</H3>
<H3>Identify the Quality Encoding</H3>


<p>Use the table above (the table is from the Fastq Wikipedia page) to identify the quality encoding of these three reads. You only have to differentiate between Sanger (S), Solexa (X) and Illumina (I, J).</p>
<p>Using the table above, identify the encoding for the three FASTQ reads shown below. You only need to distinguish between:</p>


<ul>
<li>Sanger / Illumina 1.8+ (Phred+33)</li>
<li>Solexa (Solexa+64)</li>
<li>Illumina 1.3+ / 1.5+ (Phred+64)</li>
</ul>


<pre>
<pre>
Line 53: Line 60:
</pre>
</pre>


<b>Q1. What encoding(s) is read 1?</b><br>
<p><b>Q1.</b> What encoding(s) is read 1?</p>
<b>Q2. What encoding(s) is read 2?</b><br>
<p><b>Q2.</b> What encoding(s) is read 2?</p>
<b>Q3. What encoding(s) is read 3?</b><br>
<p><b>Q3.</b> What encoding(s) is read 3?</p>
<b>Q4. Can you think of situations were it is not possible to differentiate between Phred+33 and Solexa+64 quality encoding?</b><br>
<p><b>Q4.</b> In which situations is it impossible to distinguish between Phred+33 and Solexa/Illumina Phred+64 encoding?</p>
<b>Q5. You see a base with a quality score associate with it of 'D', what is the probability that this base is wrong assuming Illumina 1.8+ Phred+33?</b><br>
<p><b>Q5.</b> If you see a base with quality character “D”, what is the probability that the base is wrong (Illumina 1.8+, Phred+33)?</p>


<HR>
<HR>


In general, we recommend verifying the encoding '''before''' starting to work especially with older data.
<p>In general, always verify FASTQ quality encoding before processing older datasets. Incorrect assumptions can lead to completely wrong quality scores and downstream artifacts.</p>


<p>Congratulations you finished the exercise!</p>
<p>Congratulations you finished the exercise!</p>


[[Data_basics_exercise_answers]]
[[Data_basics_exercise_answers]]

Latest revision as of 13:09, 20 November 2025

This exercise introduces FASTQ quality encodings. Your task is to inspect three example reads and determine which encoding system was used.


Background: FASTQ Quality Encoding

FASTQ files store base qualities as ASCII characters. Each character represents a PHRED quality score, which indicates the probability that a base call is wrong. Different sequencing platforms historically used different encodings, so it is important to know how to interpret them.

The table below (adapted from the FASTQ Wikipedia page) shows the ASCII ranges used by the most common quality encoding schemes:

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0 to 40)
 X - Solexa        Solexa+64, raw reads typically (-5 to 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0 to 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3 to 40)
     (with control characters for scores 0–2)
 L - Illumina 1.8+ Phred+33,  raw reads typically (0 to 41)

The bottom line shows all possible ASCII characters used to encode quality scores. The letters above them show which encoding schemes occupy which part of the ASCII range.

Important: Many older Illumina datasets use Phred+64 encoding. Modern Illumina data (1.8+ and later) always uses Phred+33.


Identify the Quality Encoding

Using the table above, identify the encoding for the three FASTQ reads shown below. You only need to distinguish between:

  • Sanger / Illumina 1.8+ (Phred+33)
  • Solexa (Solexa+64)
  • Illumina 1.3+ / 1.5+ (Phred+64)
@HWUSI-EAS656_0037_FC:3:1:16637:1035#NNNNNN/1
CATATTTTGTGGCTCATCCCAAGGGAGAGGTTTTTCTATACTCAGGAGAAGTTACTCACGATAAAGAGAA
+
41?8FFF@@DAGGGEDF@FGECGGGBG@GE.EEBGBDADBBEEBEEC>ACE>CD?EEC?CAB>EB:BC##

@FC42RW0AAXX:3:1:2:1038#NNNNNN/1
GTGTTCTCTGCGACCCGTAATTCAGCTTTTTCCGGTTGCTTTGCCCTTTGCACCTTATCCTGCACCATCTCGC
+
a]baaaa`aaaV`a_aa^Y^`_`_aa___`a]U__\\`][Z_^^R]YWWW[SWZ[QFY[VVWZWBBBBBBBBB

@I330_1_FC30JM6AAXX:4:1:13:1602/1
ATGTAGAAGTGTTTGATACGGCGATTTCAAACATTGCAGGGCTT
+I330_1_FC30JM6AAXX:4:1:13:1602/1
hhhhhhhhhhhhhhhhhhYh^hhhH[I>B^AABGDK;KBP??FN

Q1. What encoding(s) is read 1?

Q2. What encoding(s) is read 2?

Q3. What encoding(s) is read 3?

Q4. In which situations is it impossible to distinguish between Phred+33 and Solexa/Illumina Phred+64 encoding?

Q5. If you see a base with quality character “D”, what is the probability that the base is wrong (Illumina 1.8+, Phred+33)?


In general, always verify FASTQ quality encoding before processing older datasets. Incorrect assumptions can lead to completely wrong quality scores and downstream artifacts.

Congratulations — you finished the exercise!

Data_basics_exercise_answers