Fun with biology - find english words

From 22113
Revision as of 16:22, 6 March 2024 by WikiSysop (talk | contribs) (Created page with "__NOTOC__ === Description === Parse the entire uniprot database and extract the ID and the sequences. Find English words that are hidden (actually occur randomly) in the sequences. The words must be between 3 and 10 letters long, both inclusive. Display or save in a file the ID together with the words found in the sequence, but only if the total number of letters is 5 or more for that entry. === Input/output === Download the entire [https://teaching.healthtech.dtu.dk/ma...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Description

Parse the entire uniprot database and extract the ID and the sequences. Find English words that are hidden (actually occur randomly) in the sequences. The words must be between 3 and 10 letters long, both inclusive. Display or save in a file the ID together with the words found in the sequence, but only if the total number of letters is 5 or more for that entry.

Input/output

Download the entire swissprot database. This will be your input file to your program.
Unpack it yourself with

gunzip uniprot_sprot.dat.gz

or whatever method you prefer. Careful, it will take up 3 GB.
Notice there are many swissprot entries in the file and your program must handle that.

You must use this list of English words for your search. The list contains a lot of words, that will never match for a variety of reasons (describe the reasons in the comments in the program) and your program must remove these words from the list, when loading the file.

If more than one word matches in a certain position in the sequence, then you should only report the longest word.

Example: Only report SNAKES
RQEIGQIVGCSRESNAKESTRELHILKMLEDQNLI
RQEIGQIVGCSRESNAKESTRELHILKMLEDQNLI

Consider that words can overlap.
RQEIGQIVGCSRESNAKESTRELHILKMLEDQNLI
RQEIGQIVGCSRESNAKESTRELHILKMLEDQNLI