ExGeany-Answers
Answers to the exercise in Plain text files and Geany
Answers by: Rasmus Wernersson and Henrik Nielsen
Question 1:
The file sizes are:
453 bytes: alpha_globin_OldMac.fsa 453 bytes: alpha_globin_Unix.fsa 461 bytes: alpha_globin_Windows.fsa
The important thing to notice here is that DOS/Windows newlines actually consists of two bytes (CR + LF), whereas UNIX and the old Mac standard only use one byte.
The 8 byte difference corresponds to the 8 lines of text within the file:
001 >pigeon_alpha-globin-D 002 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG 003 GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT 004 GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG 005 AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC 006 CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC 007 CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA 008 TAA
Question 2:
Yes - inspecting the files in the associated programs (e.g. Word and FireFox) reveals the _textual_ contents to be the same.
The file sizes differ dramatically:
29184 bytes: alpha_globin.doc 667 bytes: alpha_globin.html 855 bytes: alpha_globin.rtf
Question 3:
The alpha_globin.doc file cannot be opened, because it is not a text file. In other words, not every byte in the file can be interpreted as a character.
The HTML and RTF files also contain some extra information, but unlike the DOC file, the extra information is text based.
Contents of the HTML file:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body> < PRE> >pigeon_alpha-globin-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA TAA < /PRE> </body> </html>
In this case (cleanly formatted HTML) it's easy to locate the original DNA sequence.
To some degree it's possible to figure out what's going on in the RTF file - the codes are basically about formatting:
Snippet from the file:
\f0\b\fs24 \cf0 >pigeon_alpha-globin-D\ \f1\b0 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG\ GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT\ GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG\ AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC\ CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC\ CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA\
The Word file contain a HUGE amount of additional information in BINARY form, this is why Geany refuses to open it. Opening other non-text files such as a JPG image or an MP3 sound file will also fail in Geany. Certain text editors are less critical with regards to the files they open, but when the file is binary, the results will look very strange.
Here is a snippet of the alpha_globin.doc file as displayed by the Unix editor vim:
^@^@^@D^A^@^@^L^@^@^@P^A^@^@^M^@^@^@\^A^@^@^N^@^@^@h^A^@^@^O^@^@^@p^A^@^@^P^@^@^@ x^A^@^@^S^@^@^@<80>^A^@^@^Q^@^@^@<88>^A^@^@^B^@^@^@^P'^@^@^^^@^@^@^X^@^@^@>pigeon _alpha-globin-D^@^@^^^@^@^@^D^@^@^@^@^@^@^@^^^@^@^@^T^@^@^@Rasmus Wernersson^@^@ ^@^^^@^@^@^D^@^@^@^@^@^@^@^^^@^@^@^H^@^@^@Normal^@^@^^^@^@^@^T^@^@^@Rasmus Werner sson^@^@^@^^^@^@^@^D^@^@^@1^@^@^@^^^@^@^@^X^@^@^@Microsoft Word 11.5.0^@^@^@@^@^@ ^@^@FÃ#^@^@^@^@@^@^@^@^@âÄò<91><81>É^A@^@^@^@^@(<88>^V<92><81>É^A^C^@^@^@^A^@^@^@ ^C^@^@^@^@^@^@^@^C^@^@^@^@^@^@^@^C^@^@^@^@^@^@^@G^@^@^@82^@^@þÿÿÿPICT20^@^@^@^@^C I^BR^@^Q^Bÿ^L^@ÿþ^@^@^A,^@^@^A,^@^@^@^@^@^@^M´ ¯^@^@^@^@^@¡^Aò^@^DMSWD^@^^^@^A ^@^@^@^@^@^M´ ¯^@,^@^N÷@^KCourier New^@^C÷@^@^M^@%^@.^@^D^@^@^@^@^@(^AK^Aw^A>
Interestingly, it actually possible to get a glimpse of a few text-strings within the mess of symbols, including the sequence name and the name (Rasmus Wernersson) of the person who created the file.
Question 4:
Cleaned up sequence:
AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGA GTTGCCGGGGTTGCAGGAGGAAACGGCTCTACCAATAGTGGAGCAAGGTTAATGTCCACA CAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTG ACTATTAAGGAGTTGCAGAAAGCTGCGATCGACTATTTCATTGATTATGCAGGAATGGAC GAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCC TCAATGGGACCAGATTTTACTAAGGCAAGCTATAGCACTTTTGGAACATGGACTGATATT ACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCC GGCGCCGCC