Searching for motifs in sequences
Description
A sequence motif is typically a short sequence pattern of DNA or amino acid sequence that is conserved across various gene families or organisms. Sequence motifs are recognizable and could be a promoter, a binding site or a domain that folds into a specific structure. The mechanism for finding motifs is often Hidden Markov Models or Neural Networks, which both require a lot of examples of the motif to work, but here you will explore a different method, which just needs a consensus sequence (or what you think the consensus sequence is). It uses a model where sites in the sequence are weighted after importance and finds those matches which are close.
The program is given a fasta file with DNA or amino acid sequences and a file containing a description of the signal to search for. It will then display all occurrences of a match in each fasta entry.
Input and output
The program is given a fasta file, a signal description file and a deviation (a number) as input on the command line. Deviation is the deviation allowed from the original signal description. The fasta file can contain more than one sequence/entry. The signal description file is a tab separated file. Each line consists of either 1) one or more allowed letters at this position and a penalty for having a mismatch at that position. 2) the star character * denoting unimportant characters in the sequence and an interval where these unimportant characters are allowed. 3) the hash character # meaning this line is a comment, and should be ignored by the program.
As an example of a signal description file, here is a prokaryotic promoter, which is a 2-parts signal that can be described as:
# -35 element T 7 T 8 G 6 A 5 C 5 A 5 # intervening unimportant bases * 15-21 # -10 element T 8 A 8 T 6 A 6 AT 5 T 8
The output should list all matches in each fasta entry, clearly stating the location of the match.
Details
The deviation is an important factor. If the deviation is set to 0, then search for the signal is reduced to an exact match. If the deviation is set to 16 in the above example, then mismatches with the combined penalty of 16 or less are allowed. In the promoter example the following signals would match (ignoring intervening bases, not complete list):
TTGXXX*TATAAT TTGACA*XXTAAT TTGXXA*TAXAAT
Note: I do not consider an approach based heavily on regular expression as a good idea, but you are free to do as you like. Due to the flexibility in the signal caused by the gaps, several different matches can be found from a specific position in the sequence. You should find all matches, not just the first one. It needs to be understood that the program/algorithm is completely general and can search for any kind of signal in any kind of text sequences. It is not limited to life science.
Fasta file to use as example in the projects: motif.fsa