ExGeany-Answers

From 22111
Jump to navigation Jump to search

Answers to the exercise in Plain text files and Geany

Answers by: Rasmus Wernersson and Henrik Nielsen

Question 1:

The file sizes are:

453 bytes: alpha_globin_OldMac.fsa
453 bytes: alpha_globin_Unix.fsa
461 bytes: alpha_globin_Windows.fsa

The important thing to notice here is that DOS/Windows newlines actually consists of two bytes (CR + LF), whereas UNIX and the old Mac standard only use one byte.

The 8 byte difference corresponds to the 8 lines of text within the file:

001 >pigeon_alpha-globin-D
002 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
003 GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
004 GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
005 AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
006 CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
007 CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
008 TAA

Question 2:

Yes - inspecting the files in the associated programs (e.g. Word and FireFox) reveals the _textual_ contents to be the same.

The file sizes differ dramatically:

29184 bytes: alpha_globin.doc
  667 bytes: alpha_globin.html
  855 bytes: alpha_globin.rtf

Question 3:

The alpha_globin.doc file cannot be opened, because it is not a text file. In other words, not every byte in the file can be interpreted as a character.

The HTML and RTF files also contain some extra information, but unlike the DOC file, the extra information is text based.

Contents of the HTML file:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body>
< PRE>
>pigeon_alpha-globin-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
< /PRE>
</body>
</html>

In this case (cleanly formatted HTML) it's easy to locate the original DNA sequence.

To some degree it's possible to figure out what's going on in the RTF file - the codes are basically about formatting:

Snippet from the file:

\f0\b\fs24 \cf0 >pigeon_alpha-globin-D\

\f1\b0 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG\
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT\
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG\
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC\
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC\
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA\

The Word file contain a HUGE amount of additional information in BINARY form, this is why Geany refuses to open it. Opening other non-text files such as a JPG image or an MP3 sound file will also fail in Geany. Certain text editors are less critical with regards to the files they open, but when the file is binary, the results will look very strange.

Here is a snippet of the alpha_globin.doc file as displayed by the Unix editor vim:

^@^@^@D^A^@^@^L^@^@^@P^A^@^@^M^@^@^@\^A^@^@^N^@^@^@h^A^@^@^O^@^@^@p^A^@^@^P^@^@^@
x^A^@^@^S^@^@^@<80>^A^@^@^Q^@^@^@<88>^A^@^@^B^@^@^@^P'^@^@^^^@^@^@^X^@^@^@>pigeon
_alpha-globin-D^@^@^^^@^@^@^D^@^@^@^@^@^@^@^^^@^@^@^T^@^@^@Rasmus Wernersson^@^@
^@^^^@^@^@^D^@^@^@^@^@^@^@^^^@^@^@^H^@^@^@Normal^@^@^^^@^@^@^T^@^@^@Rasmus Werner
sson^@^@^@^^^@^@^@^D^@^@^@1^@^@^@^^^@^@^@^X^@^@^@Microsoft Word 11.5.0^@^@^@@^@^@
^@^@FÃ#^@^@^@^@@^@^@^@^@âÄò<91><81>É^A@^@^@^@^@(<88>^V<92><81>É^A^C^@^@^@^A^@^@^@
^C^@^@^@^@^@^@^@^C^@^@^@^@^@^@^@^C^@^@^@^@^@^@^@G^@^@^@82^@^@þÿÿÿPICT20^@^@^@^@^C
I^BR^@^Q^Bÿ^L^@ÿþ^@^@^A,^@^@^A,^@^@^@^@^@^@^M´    ¯^@^@^@^@^@¡^Aò^@^DMSWD^@^^^@^A
^@^@^@^@^@^M´     ¯^@,^@^N÷@^KCourier New^@^C÷@^@^M^@%^@.^@^D^@^@^@^@^@(^AK^Aw^A>

Interestingly, it actually possible to get a glimpse of a few text-strings within the mess of symbols, including the sequence name and the name (Rasmus Wernersson) of the person who created the file.

Question 4:

Cleaned up sequence:

AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGA
GTTGCCGGGGTTGCAGGAGGAAACGGCTCTACCAATAGTGGAGCAAGGTTAATGTCCACA
CAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT
GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTG
ACTATTAAGGAGTTGCAGAAAGCTGCGATCGACTATTTCATTGATTATGCAGGAATGGAC
GAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA
AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCC
TCAATGGGACCAGATTTTACTAAGGCAAGCTATAGCACTTTTGGAACATGGACTGATATT
ACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT
TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCC
GGCGCCGCC