Data basics exercise answers: Difference between revisions
(Created page with "Answers: # S or L # X, I or J # X # Yes, quality scores picked from [<=>?@ABCDEFGHI], either very good quality (+33) or very poor (+64). # D = 68, 68-33 = 35, -> p[error] = 10^[-3.5] = 0.00031622776 = 1/3162 This really goes to show that having metadata from the sequencing run is essential for proper analysis.") |
No edit summary |
||
| Line 1: | Line 1: | ||
Answers | <h2>Data Basics – Quality Encoding Answers</h2> | ||
# | <p>Below are suggested answers with explanations. In some cases, more than one encoding is <i>theoretically</i> possible based on the ASCII range alone, but one is much more likely given how real data and platforms behave.</p> | ||
<hr> | |||
<h3>Q1. What encoding(s) is read 1?</h3> | |||
This | |||
<p><b>Suggested answer:</b> Phred+33 (Sanger / Illumina 1.8+)</p> | |||
<p><b>Explanation:</b></p> | |||
<ul> | |||
<li>The quality string for read 1 is:</li> | |||
</ul> | |||
<pre> | |||
41?8FFF@@DAGGGEDF@FGECGGGBG@GE.EEBGBDADBBEEBEEC>ACE>CD?EEC?CAB>EB:BC## | |||
</pre> | |||
<ul> | |||
<li>The characters used include: <code>#</code>, <code>.</code>, digits like <code>1</code>, <code>4</code>, and letters up to about <code>G</code>.</li> | |||
<li>The lowest characters are: | |||
<ul> | |||
<li><code>#</code> (ASCII 35)</li> | |||
<li><code>.</code> (ASCII 46)</li> | |||
</ul> | |||
</li> | |||
<li>The highest characters are around: | |||
<ul> | |||
<li><code>G</code> (ASCII 71)</li> | |||
</ul> | |||
</li> | |||
<li>Phred+33 encodes Q-scores 0–40 as ASCII 33–73. All observed characters fit comfortably inside this range.</li> | |||
<li>Phred+64 encodings (Solexa or Illumina 1.3+/1.5+) would normally use characters from ASCII 64 upwards. Here we clearly see characters <b>below</b> 64 (e.g. <code>#</code>, <code>.</code>, digits), so it cannot be a +64 encoding.</li> | |||
</ul> | |||
<p>Conclusion: this read must be Phred+33. That corresponds to Sanger or Illumina 1.8+ and later; we cannot distinguish between those two from the quality line alone, so “Sanger / Illumina 1.8+ (Phred+33)” is the most precise answer.</p> | |||
<hr> | |||
<h3>Q2. What encoding(s) is read 2?</h3> | |||
<p><b>Suggested answer:</b> Phred+64 (Illumina 1.3+ / 1.5+); Solexa+64 is theoretically possible but less likely.</p> | |||
<p><b>Explanation:</b></p> | |||
<ul> | |||
<li>The quality string for read 2 is:</li> | |||
</ul> | |||
<pre> | |||
a]baaaa`aaaV`a_aa^Y^`_`_aa___`a]U__\\`][Z_^^R]YWWW[SWZ[QFY[VVWZWBBBBBBBBB | |||
</pre> | |||
<ul> | |||
<li>Characters used include: <code>a</code> (ASCII 97), <code>]</code> (93), <code>`</code> (96), <code>_</code> (95), <code>^</code> (94), <code>W</code> (87), <code>V</code> (86), <code>Z</code> (90), <code>S</code> (83), <code>Q</code> (81), <code>F</code> (70), <code>B</code> (66).</li> | |||
<li>The lowest character is about <code>B</code> (ASCII 66), and the highest is <code>a</code> (ASCII 97).</li> | |||
<li>Phred+33 would map these to Q-scores 33–64 (ASCII – 33). Values above Q 41 are strongly atypical for short-read Illumina data and would usually indicate something is wrong, so Phred+33 is very unlikely.</li> | |||
<li>Phred+64 encodings (Solexa or Illumina 1.3+/1.5+) map Q 0–40 to ASCII 64–104. All observed characters (66–97) fit neatly into this range.</li> | |||
<li>Can we distinguish Solexa+64 (X) from Illumina 1.3+/1.5+ (I/J)? Not perfectly from this one read: | |||
<ul> | |||
<li>Solexa+64 allows negative Q-scores (down to about –5), which would appear as characters below ASCII 64 (for example <code>;</code> is 59).</li> | |||
<li>In this read, we see only characters at 66 and above, which are consistent with Illumina 1.3+ / 1.5+ (I or J), and also with the upper part of Solexa+64.</li> | |||
<li>However, in practice, this pattern (no low characters, many high ones) is much more typical of Illumina Phred+64 data than Solexa.</li> | |||
</ul> | |||
</li> | |||
</ul> | |||
<p>Conclusion: the read is clearly using a Phred+64 scheme. Theoretically it could be X, I or J, but for real Illumina data the most likely answer is Illumina 1.3+ or 1.5+ (Phred+64).</p> | |||
<hr> | |||
<h3>Q3. What encoding(s) is read 3?</h3> | |||
<p><b>Suggested answer:</b> Solexa+64 (X) is the best match; Phred+33 and Illumina Phred+64 (I/J) are inconsistent with the characters used.</p> | |||
<p><b>Explanation:</b></p> | |||
<ul> | |||
<li>The quality string for read 3 is:</li> | |||
</ul> | |||
<pre> | |||
hhhhhhhhhhhhhhhhhhYh^hhhH[I>B^AABGDK;KBP??FN | |||
</pre> | |||
<ul> | |||
<li>Characters include: | |||
<ul> | |||
<li><code>h</code> (ASCII 104)</li> | |||
<li><code>Y</code> (89)</li> | |||
<li><code>^</code> (94)</li> | |||
<li><code>H</code> (72)</li> | |||
<li><code>[</code> (91)</li> | |||
<li><code>I</code> (73)</li> | |||
<li><code>></code> (62)</li> | |||
<li><code>B</code> (66)</li> | |||
<li><code>;</code> (59)</li> | |||
<li><code>?</code> (63)</li> | |||
</ul> | |||
</li> | |||
<li>Notice the mix of: | |||
<ul> | |||
<li>Very high characters like <code>h</code> (104)</li> | |||
<li>And very low ones like <code>;</code> (59) and <code>></code> (62)</li> | |||
</ul> | |||
</li> | |||
<li>If we tried Phred+33: | |||
<ul> | |||
<li><code>h</code> (104) → Q = 104 – 33 = 71, which is far outside the typical Q range for short reads (normally up to about 40–41).</li> | |||
<li>This would imply unrealistically perfect data; very unlikely.</li> | |||
</ul> | |||
</li> | |||
<li>If we tried Illumina Phred+64 (I/J): | |||
<ul> | |||
<li>Valid Q-scores are 0–40, so ASCII 64–104.</li> | |||
<li>Characters like <code>;</code> (59), <code>></code> (62) and <code>?</code> (63) are below 64, and thus incompatible with Illumina Phred+64.</li> | |||
</ul> | |||
</li> | |||
<li>Solexa+64 (X) encodes Q-scores roughly from –5 to 40 as ASCII 59–104. | |||
<ul> | |||
<li><code>;</code> (59) → Q ≈ –5</li> | |||
<li><code>h</code> (104) → Q ≈ 40</li> | |||
<li>The full range 59–104 fits Solexa+64 perfectly.</li> | |||
</ul> | |||
</li> | |||
</ul> | |||
<p>Conclusion: the only encoding consistent with both the low (<code>;</code>) and high (<code>h</code>) characters is Solexa+64 (X).</p> | |||
<hr> | |||
<h3>Q4. When is it not possible to distinguish between Phred+33 and Phred+64 encodings?</h3> | |||
<p><b>Suggested answer:</b> When all observed quality characters fall in the ASCII range where Phred+33 and Phred+64 overlap, i.e. roughly 59–73.</p> | |||
<p><b>Explanation:</b></p> | |||
<ul> | |||
<li>Phred+33 encodes Q 0–40 as ASCII 33–73.</li> | |||
<li>Phred+64 (Solexa/Illumina) typically encodes Q 0–40 (or –5 to 40 for Solexa) as ASCII 59–104.</li> | |||
<li>The overlapping ASCII region is: | |||
<ul> | |||
<li>59–73 (characters such as <code>;</code>, <code><</code>, <code>=</code>, <code>></code>, <code>?</code>, <code>@</code>, <code>A</code>, <code>B</code>, <code>C</code>, <code>D</code>, <code>E</code>, <code>F</code>, <code>G</code>, <code>H</code>, <code>I</code>).</li> | |||
</ul> | |||
</li> | |||
<li>If <b>all</b> qualities in a file are in this overlapping range, both encodings produce valid Q-scores and you cannot determine from the characters alone whether it is Phred+33 or Phred+64.</li> | |||
<li>In practice, you usually use metadata (sequencer model, run date) or tools like FastQC to confirm the encoding.</li> | |||
</ul> | |||
<hr> | |||
<h3>Q5. You see a base with quality character “D” under Illumina 1.8+ (Phred+33). What is the probability that this base is wrong?</h3> | |||
<p><b>Suggested answer:</b> Approximately 0.000316 (about 1 in 3,162).</p> | |||
<p><b>Explanation:</b></p> | |||
<ul> | |||
<li>Illumina 1.8+ uses Phred+33, so: | |||
<ul> | |||
<li>ASCII code of <code>D</code> is 68.</li> | |||
<li>Phred score Q = 68 – 33 = 35.</li> | |||
</ul> | |||
</li> | |||
<li>The PHRED definition is: | |||
<ul> | |||
<li>Q = –10 × log10(p_error)</li> | |||
<li>p_error = 10^(–Q / 10) = 10^(–35 / 10) = 10^(–3.5)</li> | |||
</ul> | |||
</li> | |||
<li>Numerically: | |||
<ul> | |||
<li>10^(–3.5) ≈ 0.000316</li> | |||
<li>This is about 1 error per 3,162 bases.</li> | |||
</ul> | |||
</li> | |||
</ul> | |||
<p>This illustrates how high-quality bases (Q35+) are extremely reliable, and how misinterpreting the encoding could dramatically inflate or deflate error estimates.</p> | |||
<hr> | |||
<p>In summary, metadata from the sequencing run and basic QC tools (e.g. FastQC, seqtk) are crucial to correctly identify quality encodings before analysis. Incorrect assumptions about encoding can lead to completely wrong quality scores and misleading downstream results.</p> | |||
Latest revision as of 13:15, 20 November 2025
Data Basics – Quality Encoding Answers
Below are suggested answers with explanations. In some cases, more than one encoding is theoretically possible based on the ASCII range alone, but one is much more likely given how real data and platforms behave.
Q1. What encoding(s) is read 1?
Suggested answer: Phred+33 (Sanger / Illumina 1.8+)
Explanation:
- The quality string for read 1 is:
41?8FFF@@DAGGGEDF@FGECGGGBG@GE.EEBGBDADBBEEBEEC>ACE>CD?EEC?CAB>EB:BC##
- The characters used include:
#,., digits like1,4, and letters up to aboutG. - The lowest characters are:
#(ASCII 35).(ASCII 46)
- The highest characters are around:
G(ASCII 71)
- Phred+33 encodes Q-scores 0–40 as ASCII 33–73. All observed characters fit comfortably inside this range.
- Phred+64 encodings (Solexa or Illumina 1.3+/1.5+) would normally use characters from ASCII 64 upwards. Here we clearly see characters below 64 (e.g.
#,., digits), so it cannot be a +64 encoding.
Conclusion: this read must be Phred+33. That corresponds to Sanger or Illumina 1.8+ and later; we cannot distinguish between those two from the quality line alone, so “Sanger / Illumina 1.8+ (Phred+33)” is the most precise answer.
Q2. What encoding(s) is read 2?
Suggested answer: Phred+64 (Illumina 1.3+ / 1.5+); Solexa+64 is theoretically possible but less likely.
Explanation:
- The quality string for read 2 is:
a]baaaa`aaaV`a_aa^Y^`_`_aa___`a]U__\\`][Z_^^R]YWWW[SWZ[QFY[VVWZWBBBBBBBBB
- Characters used include:
a(ASCII 97),](93),`(96),_(95),^(94),W(87),V(86),Z(90),S(83),Q(81),F(70),B(66). - The lowest character is about
B(ASCII 66), and the highest isa(ASCII 97). - Phred+33 would map these to Q-scores 33–64 (ASCII – 33). Values above Q 41 are strongly atypical for short-read Illumina data and would usually indicate something is wrong, so Phred+33 is very unlikely.
- Phred+64 encodings (Solexa or Illumina 1.3+/1.5+) map Q 0–40 to ASCII 64–104. All observed characters (66–97) fit neatly into this range.
- Can we distinguish Solexa+64 (X) from Illumina 1.3+/1.5+ (I/J)? Not perfectly from this one read:
- Solexa+64 allows negative Q-scores (down to about –5), which would appear as characters below ASCII 64 (for example
;is 59). - In this read, we see only characters at 66 and above, which are consistent with Illumina 1.3+ / 1.5+ (I or J), and also with the upper part of Solexa+64.
- However, in practice, this pattern (no low characters, many high ones) is much more typical of Illumina Phred+64 data than Solexa.
- Solexa+64 allows negative Q-scores (down to about –5), which would appear as characters below ASCII 64 (for example
Conclusion: the read is clearly using a Phred+64 scheme. Theoretically it could be X, I or J, but for real Illumina data the most likely answer is Illumina 1.3+ or 1.5+ (Phred+64).
Q3. What encoding(s) is read 3?
Suggested answer: Solexa+64 (X) is the best match; Phred+33 and Illumina Phred+64 (I/J) are inconsistent with the characters used.
Explanation:
- The quality string for read 3 is:
hhhhhhhhhhhhhhhhhhYh^hhhH[I>B^AABGDK;KBP??FN
- Characters include:
h(ASCII 104)Y(89)^(94)H(72)[(91)I(73)>(62)B(66);(59)?(63)
- Notice the mix of:
- Very high characters like
h(104) - And very low ones like
;(59) and>(62)
- Very high characters like
- If we tried Phred+33:
h(104) → Q = 104 – 33 = 71, which is far outside the typical Q range for short reads (normally up to about 40–41).- This would imply unrealistically perfect data; very unlikely.
- If we tried Illumina Phred+64 (I/J):
- Valid Q-scores are 0–40, so ASCII 64–104.
- Characters like
;(59),>(62) and?(63) are below 64, and thus incompatible with Illumina Phred+64.
- Solexa+64 (X) encodes Q-scores roughly from –5 to 40 as ASCII 59–104.
;(59) → Q ≈ –5h(104) → Q ≈ 40- The full range 59–104 fits Solexa+64 perfectly.
Conclusion: the only encoding consistent with both the low (;) and high (h) characters is Solexa+64 (X).
Q4. When is it not possible to distinguish between Phred+33 and Phred+64 encodings?
Suggested answer: When all observed quality characters fall in the ASCII range where Phred+33 and Phred+64 overlap, i.e. roughly 59–73.
Explanation:
- Phred+33 encodes Q 0–40 as ASCII 33–73.
- Phred+64 (Solexa/Illumina) typically encodes Q 0–40 (or –5 to 40 for Solexa) as ASCII 59–104.
- The overlapping ASCII region is:
- 59–73 (characters such as
;,<,=,>,?,@,A,B,C,D,E,F,G,H,I).
- 59–73 (characters such as
- If all qualities in a file are in this overlapping range, both encodings produce valid Q-scores and you cannot determine from the characters alone whether it is Phred+33 or Phred+64.
- In practice, you usually use metadata (sequencer model, run date) or tools like FastQC to confirm the encoding.
Q5. You see a base with quality character “D” under Illumina 1.8+ (Phred+33). What is the probability that this base is wrong?
Suggested answer: Approximately 0.000316 (about 1 in 3,162).
Explanation:
- Illumina 1.8+ uses Phred+33, so:
- ASCII code of
Dis 68. - Phred score Q = 68 – 33 = 35.
- ASCII code of
- The PHRED definition is:
- Q = –10 × log10(p_error)
- p_error = 10^(–Q / 10) = 10^(–35 / 10) = 10^(–3.5)
- Numerically:
- 10^(–3.5) ≈ 0.000316
- This is about 1 error per 3,162 bases.
This illustrates how high-quality bases (Q35+) are extremely reliable, and how misinterpreting the encoding could dramatically inflate or deflate error estimates.
In summary, metadata from the sequencing run and basic QC tools (e.g. FastQC, seqtk) are crucial to correctly identify quality encodings before analysis. Incorrect assumptions about encoding can lead to completely wrong quality scores and misleading downstream results.