ExJEdit-Answers

From 22111
Jump to navigation Jump to search

Answer to the JEdit exercise

Notice: here the answers are written in text-only format, to illustrate what answers can look like written in JEdit.

EXERCISE: jEdit
----------------
Answers by: Rasmus Wernersson (v18103)

Question 1:
-----------
The file sizes are:

453 bytes: alpha_globin_OldMac.fsa
453 bytes: alpha_globin_Unix.fsa
461 bytes: alpha_globin_Windows.fsa

The important thing to notice here is that DOS/Windows newlines actually
consists of two bytes (CR + LF), whereas UNIX and the old Mac standard only use 
one byte.

The 8 byte difference corresponds to the 8 lines of text with-in the file:

001 >pigeon_alpha-globin-D
002 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
003 GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
004 GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
005 AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
006 CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
007 CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
008 TAA


Question 2:
-----------
Yes - inspecting the files in the associated programs (e.g. Word and FireFox)
reveals the _textual_ contents to be the same.

The file sizes differ dramatically:

29184 bytes: alpha_globin.doc
  667 bytes: alpha_globin.html
  855 bytes: alpha_globin.rtf

Question 3:
-----------
In all three cases a (LOT) of extra information has been added to the files. 
For both the HTML and RTF files, the extra information is actually text based
it's it's possible to get an idea of what's going on by simply inspecting the 
file.

Contents of the HTML file:
>>>>>>>>>>>>>>>>>>>>>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body>
< PRE>
>pigeon_alpha-globin-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
< /PRE>
</body>
</html>
<<<<<<<<<<<<<<<<<<<<<

In this case (cleanly formatted HTML) it's easy to locate the original DNA 
sequence.

To some degree it's possible to figure out what's going on in the RTF file -
the codes are basically about formatting:

Snippet from the file:
>>>>>>>>>>>>>>>>>>>>>
\f0\b\fs24 \cf0 >pigeon_alpha-globin-D\

\f1\b0 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG\
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT\
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG\
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC\
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC\
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA\
<<<<<<<<<<<<<<<<<<<<<

The Word file contain a HUGE amount of additional information - in BINARY
form, this is why the file looks so strange when we open it in jEdit.
Opening a non-text file such as a JPG image in jEdit will look a bit the same:
a lot of strange symbols.

Interestingly, it actully possible to get a glimpse of a few text-strings with 
in mess of symbols - the DNA sequence - and the name (Rasmus Wernersson) of the 
person who created the file.

This is some of the strings we can find (generated using the "strings" command
on a UNIX prompt):
>>>>>>>>>>>>>>>>>>>>>
jbjb
>pigeon_alpha-globin-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
>pigeon_alpha-globin-D
Rasmus Wernersson
Normal
Rasmus Wernersson
Microsoft Word 11.5.0
PICT20
MSWD
Courier New
Technical University of Denmark
>pigeon_alpha-globin-D
Title
Microsoft Word Document
NB6W
Word.Document.8
<<<<<<<<<<<<<<<<<<<<<

Question 4:
-----------
Cleaned up sequence:

AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGA
GTTGCCGGGGTTGCAGGAGGAAACGGCTCTACCAATAGTGGAGCAAGGTTAATGTCCACA
CAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT
GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTG
ACTATTAAGGAGTTGCAGAAAGCTGCGATCGACTATTTCATTGATTATGCAGGAATGGAC
GAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA
AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCC
TCAATGGGACCAGATTTTACTAAGGCAAGCTATAGCACTTTTGGAACATGGACTGATATT
ACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT
TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCC
GGCGCCGCC