Exercise: Multiple Alignments Answers (Seaview version): Difference between revisions

From 22111
Jump to navigation Jump to search
(Created page with " By: [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] ==Question 1== FASTA file: >pigeon_alpha-D-globin ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC CTGTCAACTTCAAG...")
 
 
Line 87: Line 87:


'''NOTICE''':  
'''NOTICE''':  
* It is essential to use SHORT descriptive names. In the ClustalW format alignment, only the first 15 characters of the names are shown, so if you have very long names the output can be hard to read (see also [[Media:GenBank+FASTA_handout_revised.pdf|the FASTA handout from week 2]]).  
* It is essential to use SHORT descriptive names. In the ClustalW format alignment, only the first 15 characters of the names are shown, so if you have very long names the output can be hard to read (see also [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf the FASTA handout from week 2]).  
* Spaces cannot be part of the names in a FASTA file. If there are spaces, only the first word after "<tt>&gt;</tt>" counts as the name, subsequent words will be comments. If I had used spaces instead of underscore ("<tt>_</tt>") in the file above, the names would not have been unique ("duck" would have been used twice, etc.).
* Spaces cannot be part of the names in a FASTA file. If there are spaces, only the first word after "<tt>&gt;</tt>" counts as the name, subsequent words will be comments. If I had used spaces instead of underscore ("<tt>_</tt>") in the file above, the names would not have been unique ("duck" would have been used twice, etc.).
* Be aware that in GenBank entries containing several genes (see [[Media:GenBank+FASTA_handout_revised.pdf|the GenBank handout from week 2]]), the name of the individual gene (CDS) is found within the feature table. When you click on a CDS containing "<tt>/gene_name=XYZ</tt>" or similar, it is therefore XYZ you need to use as name in your FASTA file, not the collective title for the entire GenBank entry (e.g. "<tt>Alpha-A and Alpha-D genes ...</tt>" or "<tt>Yeast Chromosome 2</tt>"). See also [[Media:MultiGeneScreenshot-en.pdf| the screenshot/handout from the exercise]].
* Be aware that in GenBank entries containing several genes (see [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf the GenBank handout from week 2]), the name of the individual gene (CDS) is found within the feature table. When you click on a CDS containing "<tt>/gene_name=XYZ</tt>" or similar, it is therefore XYZ you need to use as name in your FASTA file, not the collective title for the entire GenBank entry (e.g. "<tt>Alpha-A and Alpha-D genes ...</tt>" or "<tt>Yeast Chromosome 2</tt>"). See also [https://teaching.healthtech.dtu.dk/material/22111/MultiGeneScreenshot-en.pdf the screenshot/handout from the exercise].
When you build a "real" dataset for a research project, it is often an iterative process, where you 1) collect your data, 2) weed out outliers, 3) run an analysis, and repeat 2) and 3) until you are satisfied with the results.
When you build a "real" dataset for a research project, it is often an iterative process, where you 1) collect your data, 2) weed out outliers, 3) run an analysis, and repeat 2) and 3) until you are satisfied with the results.



Latest revision as of 12:26, 15 March 2024

By: Rasmus Wernersson

Question 1

FASTA file:

>pigeon_alpha-D-globin
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
>pigeon_alpha-A-globin
ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTG
ACTTGGGTGGTGAAGCCCTGGAGAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTT
CGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCT
GCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTG
TGGACCCCGTCAACTTCAAACTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCT
CCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAG
TACCGTTAA
>duck_alpha-D-globin
ATGCTGACCGCCGAGGACAAGAAGCTCATCGTGCAGGTGTGGGAGAAGGTGGCTGGCCACCAGGAGGAAT
TCGGAAGTGAAGCTCTGCAGAGGATGTTCCTCGCCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGA
CCTGCATCCCGGCTCTGAACAGGTCCGTGGCCATGGCAAGAAAGTGGCGGCTGCCCTGGGCAATGCCGTG
AAGAGCCTGGACAACCTCAGCCAGGCCCTGTCTGAGCTCAGCAACCTGCATGCCTACAACCTGCGTGTTG
ACCCTGTCAACTTCAAGCTGCTGGCACAGTGCTTCCAGGTGGTGCTGGCCGCACACCTGGGCAAAGACTA
CAGCCCCGAGATGCATGCTGCCTTTGACAAGTTCTTGTCCGCCGTGGCTGCCGTGCTGGCTGAAAAGTAC
AGATGA
>duck_alpha-A-globin
ATGGTGCTGTCTGCGGCTGACAAGACCAACGTCAAGGGTGTCTTCTCCAAAATCGGTGGCCATGCTGAGG
AGTATGGCGCCGAGACCCTGGAGAGGATGTTCATCGCCTACCCCCAGACCAAGACCTACTTCCCCCACTT
TGACCTGCAGCACGGCTCTGCTCAGATCAAGGCCCATGGCAAGAAGGTGGCGGCTGCCCTAGTTGAAGCT
GTCAACCACATCGATGACATTGCGGGTGCTCTCTCCAAGCTCAGTGACCTCCACGCCCAAAAGCTCCGTG
TGGACCCTGTCAACTTCAAATTCCTGGGCCACTGCTTCCTGGTGGTGGTTGCCATCCACCACCCCGCTGC
CCTGACCCCAGAGGTCCACGCTTCCCTGGACAAGTTCATGTGCGCCGTGGGTGCTGTGCTGACTGCCAAG
TACCGTTAG
>Goat_alpha-i-globin
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCGGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCTCCCCAATGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Goat_alpha-ii-globin
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCAGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCACCCCAGTGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Horse_alpha-1_globin
ATGGTGCTGTCTGCCGCCGACAAGACCAACGTCAAGGCCGCCTGGAGTAAGGTTGGCGGCCACGCTGGCG
AGTTTGGCGCAGAGGCCCTAGAGAGGATGTTCCTGGGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGATCTGAGCCACGGCTCCGCCCAGGTCAAGGCCCACGGCAAGAAGGTGGGCGACGCGCTGACTCTCGCC
GTGGGCCACCTGGACGACCTGCCTGGCGCCCTGTCGAATCTGAGCGACCTGCACGCACACAAGCTGCGCG
TGGACCCCGTCAACTTCAAGCTTCTGAGTCATTGCCTGCTGTCCACCTTGGCCGTCCACCTCCCCAACGA
TTTCACCCCTGCCGTCCACGCCTCCCTGGACAAGTTCTTGAGCAGTGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Horse_alpha-2_globin
ATGGTGCTGTCTGCCGCCGACAAGACCAACGTCAAGGCCGCCTGGAGTAAGGTTGGCGGCCACGCTGGCG
AGTATGGCGCAGAGGCCCTAGAGAGGATGTTCCTGGGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGATCTGAGCCACGGCTCCGCCCAGGTCAAGGCCCACGGCCAGAAGGTGGGCGACGCGCTGACTCTCGCC
GTGGGCCACCTGGACGACCTGCCTGGCGCCCTGTCGAATCTGAGCGACCTGCACGCACACAAGCTGCGCG
TGGACCCCGTCAACTTCAAGCTCCTGAGTCATTGCCTGCTGTCCACCTTGGCCGTCCACCTCCCCAACGA
TTTCACCCCTGCCGTCCACGCCTCCCTGGACAAGTTCTTGAGCAGTGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Chicken_alpha-D
ATGCTGACTGCCGAGGACAAGAAGCTCATCCAGCAGGCCTGGGAGAGGGCCGCTTCCCACCAGGAGGAGT
TTGGAGCTGAGGCTCTGACTAGGATGTTCACCACCTATCCCCAGACCAAGACCTACTTCCCCCACTTCGA
CCTTTCGCCTGGCTCTGACCAGGTCCGTGGCCATGGCAAGAAGGTGTTGGGTGCCCTGGGCAACGCCGTG
AAGAACGTGGACAACCTCAGCCAGGCCATGGCTGAGCTGAGCAACCTGCATGCCTACAACCTGCGTGTTG
ACCCCGTCAATTTCAAGCTGTTGTCGCAGTGCATCCAGGTGGTGCTGGCTGTACACATGGGCAAAGACTA
CACCCCTGAAGTGCATGCTGCCTTCGACAAGTTCCTGTCTGCCGTGTCTGCTGTGCTGGCTGAGAAGTAC
AGATAA
>Chicken_alpha-A
ATGGTGCTGTCCGCTGCTGACAAGAACAACGTCAAGGGCATCTTCACCAAAATCGCCGGCCATGCTGAGG
AGTATGGCGCCGAGACCCTGGAAAGGATGTTCACCACCTACCCCCCAACCAAGACCTACTTCCCCCACTT
CGATCTGTCACACGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTAGTGGCTGCCTTGATCGAGGCT
GCCAACCACATTGATGACATCGCCGGCACCCTCTCCAAGCTCAGCGACCTCCATGCCCACAAGCTCCGCG
TGGACCCTGTCAACTTCAAACTCCTGGGCCAATGCTTCCTGGTGGTGGTGGCCATCCACCACCCTGCTGC
CCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCTTGTGCGCCGTGGGCACTGTGCTGACCGCCAAG
TACCGTTAA

NOTICE:

  • It is essential to use SHORT descriptive names. In the ClustalW format alignment, only the first 15 characters of the names are shown, so if you have very long names the output can be hard to read (see also the FASTA handout from week 2).
  • Spaces cannot be part of the names in a FASTA file. If there are spaces, only the first word after ">" counts as the name, subsequent words will be comments. If I had used spaces instead of underscore ("_") in the file above, the names would not have been unique ("duck" would have been used twice, etc.).
  • Be aware that in GenBank entries containing several genes (see the GenBank handout from week 2), the name of the individual gene (CDS) is found within the feature table. When you click on a CDS containing "/gene_name=XYZ" or similar, it is therefore XYZ you need to use as name in your FASTA file, not the collective title for the entire GenBank entry (e.g. "Alpha-A and Alpha-D genes ..." or "Yeast Chromosome 2"). See also the screenshot/handout from the exercise.

When you build a "real" dataset for a research project, it is often an iterative process, where you 1) collect your data, 2) weed out outliers, 3) run an analysis, and repeat 2) and 3) until you are satisfied with the results.

Question 2

QUESTION 2a

Your screenshot of the 3' part of the alignment should look something like this: Seaview showing aligned sequences

QUESTION 2b

  1. Your tree should look like this: Seaview tree
  2. There are three clusters: One for Alpha-A (birds only), one for Alpha-D (birds only), and one for Alpha 1 + Alpha 2 (Mammals). The idea is here that birds and mammals are not intermixed, so they are "naturally" placed in a taxonomical sense.
  3. Alpha-A and Alpha-D are obviously in two different clusters — that must necessarily mean that the split between them is old. Since both Alpha-A and Alpha-D exist in all the three birds we included, the split must be older than the last common ancestor to the birds.
  4. Alpha-1 and Alpha-2 seem to be much more closely related.

QUESTION 2c

There is a single stretch of >15 nucleotides (23 to be precise) which is perfectly conserved. Its sequence is ACCAAGACCTACTTCCCCCACTT.

Question 3

The sequences are translated using Virtual Ribosome, giving rise to the following FASTA file:

>pigeon_alpha-D-globin
MLTDSDKKLVLQVWEKVIRHPDCGAEALERLFTTYPQTKTYFPHFDLHHGSDQVRNHGKK
VLAALGNAVKSLGNLSQALSDLSDLHAYNLRVDPVNFKLLAQCFHVVLATHLGNDYTPEA
HAAFDKFLSAVCTVLAEKYR*
>pigeon_alpha-A-globin
MVLSANDKSNVKAVFGKIGGQAGDLGGEALERLFITYPQTKTYFPHFDLSHGSAQIKGHG
KKVAEALVEAANHIDDIAGALSKLSDLHAQKLRVDPVNFKLLGHCFLVVVAVHFPSLLTP
EVHASLDKFVCAVGTVLTAKYR*
>duck_alpha-D-globin
MLTAEDKKLIVQVWEKVAGHQEEFGSEALQRMFLAYPQTKTYFPHFDLHPGSEQVRGHGK
KVAAALGNAVKSLDNLSQALSELSNLHAYNLRVDPVNFKLLAQCFQVVLAAHLGKDYSPE
MHAAFDKFLSAVAAVLAEKYR*
>duck_alpha-A-globin
MVLSAADKTNVKGVFSKIGGHAEEYGAETLERMFIAYPQTKTYFPHFDLQHGSAQIKAHG
KKVAAALVEAVNHIDDIAGALSKLSDLHAQKLRVDPVNFKFLGHCFLVVVAIHHPAALTP
EVHASLDKFMCAVGAVLTAKYR*
>Goat_alpha-i-globin
MVLSAADKSNVKAAWGKVGGNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
EKVAAALTKAVGHLDDLPGTLSDLSDLHAHKLRVDPVNFKLLSHSLLVTLACHLPNDFTP
AVHASLDKFLANVSTVLTSKYR*
>Goat_alpha-ii-globin
MVLSAADKSNVKAAWGKVGSNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
EKVAAALTKAVGHLDDLPGTLSDLSDLHAHKLRVDPVNFKLLSHSLLVTLACHHPSDFTP
AVHASLDKFLANVSTVLTSKYR*
>Horse_alpha-1_globin
MVLSAADKTNVKAAWSKVGGHAGEFGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHG
KKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTP
AVHASLDKFLSSVSTVLTSKYR*
>Horse_alpha-2_globin
MVLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHG
QKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTP
AVHASLDKFLSSVSTVLTSKYR*
>Chicken_alpha-D
MLTAEDKKLIQQAWERAASHQEEFGAEALTRMFTTYPQTKTYFPHFDLSPGSDQVRGHGK
KVLGALGNAVKNVDNLSQAMAELSNLHAYNLRVDPVNFKLLSQCIQVVLAVHMGKDYTPE
VHAAFDKFLSAVSAVLAEKYR*
>Chicken_alpha-A
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHFDLSHGSAQIKGHG
KKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRVDPVNFKLLGQCFLVVVAIHHPAALTP
EVHASLDKFLCAVGTVLTAKYR*

Subsequently, they are aligned with Clustal Omega.

Observations:

  • By and large the same tree on protein level as on DNA level (small differences in the branch lengths).
  • Now, two completely conserved regions of >5 amino acids are seen. Their sequences are TKTYFPHFDL and LRVDPVNFK.

Question 4

FASTA file:

>Sheep_U00659
ATGGCCCTGTGGACACGCCTGGTGCCCCTGCTGGCCCTGCTGGCACTCTGGGCCCCCGCC
CCGGCCCACGCCTTCGTCAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGAGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGCCGGGAGGTGGAGGGC
CCCCAGGTGGGGGCGCTGGAGCTGGCCGGAGGCCCCGGCGCGGGTGGCCTGGAGGGGCCC
CCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCGCCGGCGTCTGCTCTCTCTACCAGCTG
GAGAACTACTGTAACTAG
>Pig_AY044828
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242098
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242100
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242101
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242109
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTAGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Dog_V00179
ATGGCCCTCTGGATGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCG
CCCACCCGAGCCTTCGTTAACCAGCACCTGTGTGGCTCCCACCTGGTAGAGGCTCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCTAAGGCCCGCAGGGAGGTGGAGGAC
CTGCAGGTGAGGGACGTGGAGCTGGCCGGGGCGCCTGGCGAGGGCGGCCTGCAGCCCCTG
GCCCTGGAGGGGGCCCTGCAGAAGCGAGGCATCGTGGAGCAGTGCTGCACCAGCATCTGC
TCCCTCTACCAGCTGGAGAATTACTGCAACTAG
>OwlMonkey_J02989
ATGGCCCTGTGGATGCACCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCCGAG
CCAGCCCCGGCCTTTGTGAACCAGCACCTGTGCGGCCCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGTTTCTTCTACGCACCCAAGACCCGCCGGGAGGCGGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGTGGGGGCTCTATCACGGGCAGCCTGCCACCCTTG
GAGGGTCCCATGCAGAAGCGTGGCGTCGTGGATCAGTGCTGCACCAGCATCTGCTCCCTC
TACCAGCTGCAGAACTACTGCAACTAG
>Human_AY138590
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>GreenMonkey_X61092
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCGGTCCCGGCCTTTGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGCTTCTTCTACACGCCCAAGACCCGCCGGGAGGCAGAGGAC
CCGCAGGTGGGGCAGGTAGAGCTGGGCGGGGGCCCTGGCGCAGGCAGCCTGCAGCCCTTG
GCGCTGGAGGGGTCCCTGCAGAAGCGCGGCATCGTGGAGCAGTGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>Human_J00265
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>Chimp_X61089
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCTCGGCCTTTGTGAACCAACACCTGTGCGGCTCCCACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGTATCGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>GuineaPig_K02233
ATGGCTCTGTGGATGCATCTCCTCACCGTGCTGGCCCTGCTGGCCCTCTGGGGGCCCAAC
ACTAATCAGGCCTTTGTCAGCCGGCATCTGTGCGGCTCCAACTTAGTGGAGACATTGTAT
TCAGTGTGTCAGGATGATGGCTTCTTCTATATACCCAAGGACCGTCGGGAGCTAGAGGAC
CCACAGGTGGAGCAGACAGAACTGGGCATGGGCCTGGGGGCAGGTGGACTACAGCCCTTG
GCACTGGAGATGGCACTACAGAAGCGTGGCATTGTGGATCAGTGCTGTACTGGCACCTGC
ACACGCCACCAGCTGCAGAGCTACTGCAACTAG
>Mouse_X04725
ATGGCCCTGTTGGTGCACTTCCTACCCCTGCTGGCCCTGCTTGCCCTCTGGGAGCCCAAA
CCCACCCAGGCTTTTGTCAAACAGCATCTTTGTGGTCCCCACCTGGTAGAGGCTCTCTAC
CTGGTGTGTGGGGAGCGTGGCTTCTTCTACACACCCAAGTCCCGCCGTGAAGTGGAGGAC
CCACAAGTGGAACAACTGGAGCTGGGAGGAAGCCCCGGGGACCTTCAGACCTTGGCGTTG
GAGGTGGCCCGGCAGAAGCGTGGCATTGTGGATCAGTGCTGCACCAGCATCTGCTCCCTC
TACCAGCTGGAGAACTACTGCAACTAA
>Chicken_AY438372
ATGGCTCTCTGGATCCGATCACTGCCTCTTCTGGCTCTCCTTGTCTTTTCTGGCCCTGGA
ACCAGCTATGCAGCTGCCAACCAGCACCTCTGTGGCTCCCACTTGGTGGAGGCTCTCTAC
CTGGTGTGTGGAGAGCGTGGCTTCTTCTACTCCCCCAAAGCCCGACGGGATGTCGAGCAG
CCCCTAGTGAGCAGTCCCTTGCGTGGCGAGGCAGGAGTGCTGCCTTTCCAGCAGGAGGAA
TACGAGAAAGTCAAGCGAGGGATTGTTGAGCAATGCTGCCATAACACGTGTTCCCTCTAC
CAACTGGAGAACTACTGCAACTAG
>SeaHare_AF160192
ATGAGCAAGTTCCTCCTCCAGAGCCACTCCGCCAACGCCTGCCTGCTCACCCTTCTGCTCACGCTGGCCT
CCAACCTCGACATATCCCTGGCCAACTTCGAGCACTCGTGCAACGGCTACATGCGGCCCCACCCGCGGGG
TCTGTGCGGCGAAGACCTGCACGTCATCATTTCCAACCTGTGCAGCTCTCTGGGGGGCAACAGGAGGTTC
CTGGCCAAGTACATGGTCAAAAGAGACACGGAAAATGTGAACGACAAGTTACGAGGGATCCTGCTCAATA
AGAAAGAAGCTTTCTCCTACTTGACCAAGAGAGAGGCCTCAGGCTCCATCACATGCGAATGTTGCTTCAA
CCAGTGTCGGATATTTGAGCTGGCTCAGTACTGCCGTCTGCCAGACCATTTCTTCTCCAGAATATCCAGA
ACCGGAAGGAGCAACAGTGGACATGCGCAGTTGGAGGACAACTTTAGTTA

Question 5

1. Yes, there are many gaps which are not multiples of 3 positions. The most obvious example is the second gap, which is 4 positions long (in all sequences but the Sea Hare, see below). The alignment algorithm is not aware that the sequences are protein coding, it only considers the DNA.

2. Sea Hare (a marine snail) stands out — this makes sense, since it is the only invertebrate.

3. It can be seen that the two human sequences are 100% identical (the distance is 0) — one of them can therefore be discarded — and for the pig, the following sequences are identical:

>Pig_AY044828
>Pig_AY242098

and

>Pig_AY242100
>Pig_AY242101

(two pig sequences can therefore be discarded).

Question 6

The sequences are translated using Virtual Ribosome, yielding the following sequences:

>Sheep_U00659
MALWTRLVPLLALLALWAPAPAHAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEG
PQVGALELAGGPGAGGLEGPPQKRGIVEQCCAGVCSLYQLENYCN*
>Pig_AY044828
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242098
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242100
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242101
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242109
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Dog_V00179
MALWMRLLPLLALLALWAPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVED
LQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN*
>OwlMonkey_J02989
MALWMHLLPLLALLALWGPEPAPAFVNQHLCGPHLVEALYLVCGERGFFYAPKTRREAED
LQVGQVELGGGSITGSLPPLEGPMQKRGVVDQCCTSICSLYQLQNYCN*
>Human_AY138590
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>GreenMonkey_X61092
MALWMRLLPLLALLALWGPDPVPAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
PQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>Human_J00265
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>Chimp_X61089
MALWMRLLPLLVLLALWGPDPASAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>GuineaPig_K02233
MALWMHLLTVLALLALWGPNTNQAFVSRHLCGSNLVETLYSVCQDDGFFYIPKDRRELED
PQVEQTELGMGLGAGGLQPLALEMALQKRGIVDQCCTGTCTRHQLQSYCN*
>Mouse_X04725
MALLVHFLPLLALLALWEPKPTQAFVKQHLCGPHLVEALYLVCGERGFFYTPKSRREVED
PQVEQLELGGSPGDLQTLALEVARQKRGIVDQCCTSICSLYQLENYCN*
>Chicken_AY438372
MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHLVEALYLVCGERGFFYSPKARRDVEQ
PLVSSPLRGEAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLENYCN*
>SeaHare_AF160192
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNL
CSSLGGNRRFLAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCR
IFELAQYCRLPDHFFSRISRTGRSNSGHAQLEDNFS*

Subsequently, the sequences are aligned. Note that the gaps in the peptide alignment do not correspond to the gaps in the nucleotide alignment.

  1. There is a disagreement between the DNA and peptide alignment because
    1. the DNA alignment does not take codon boundaries into account, and
    2. the peptide alignment can take similarities between amino acids (conservative substitutions) into account.
  2. At the protein level, all the Pig sequences are now completely identical. Four of them can therefore be discarded.

Question 7

Yes, the alignments are different. None of the four methods solves the problem perfectly, but Clustal Omega and MAFFT are really close; they both place only one letter incorrectly, see below.

Clustal Omega: Note the two K's aligned with V's to the left of the large gaps.

MAFFT: Note the three Q's aligned with E's to the right of the large gaps.

MUSCLE and Kalign make more errors.

Question 8

  • Yes — all gaps are multiples of 3.
  • Yes — since the DNA alignment is generated using a protein alignment as a scaffold.