ExMulAlign-Answers-English: Difference between revisions

From 22111
Jump to navigation Jump to search
(Created page with "Click here for Danish version. =Answers to the Multiple Alignment exercise= By: [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] ==Question 1== FASTA file: >pigeon_alpha-D-globin ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGC...")
 
 
Line 90: Line 90:


'''NOTICE''':  
'''NOTICE''':  
* It is essential to use SHORT descriptive names. In the ClustalW format alignment, only the first 15 characters of the names are shown, so if you have very long names the output can be hard to read (see also [[Media:GenBank+FASTA_handout_revised.pdf|the FASTA handout from week 2]]). ''Notice that JalView fails'' in a very opaque way ''if names are not <u>unique within the first 15 characters</u>'' — it simply appends sequences into one long sequence, if it "thinks" they are named identically!
* It is essential to use SHORT descriptive names. In the ClustalW format alignment, only the first 15 characters of the names are shown, so if you have very long names the output can be hard to read (see also [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf the FASTA handout from week 2]). ''Notice that JalView fails'' in a very opaque way ''if names are not <u>unique within the first 15 characters</u>'' — it simply appends sequences into one long sequence, if it "thinks" they are named identically!
* Spaces cannot be part of the names in a FASTA file. If there are spaces, only the first word after "<tt>&gt;</tt>" counts as the name, subsequent words will be comments. If I had used spaces instead of underscore ("<tt>_</tt>") in the file above, the names would not have been unique ("duck" would have been used twice, etc.).
* Spaces cannot be part of the names in a FASTA file. If there are spaces, only the first word after "<tt>&gt;</tt>" counts as the name, subsequent words will be comments. If I had used spaces instead of underscore ("<tt>_</tt>") in the file above, the names would not have been unique ("duck" would have been used twice, etc.).
* Be aware that in GenBank entries containing several genes (see [[Media:GenBank+FASTA_handout_revised.pdf|the GenBank handout from week 2]]), the name of the individual gene (CDS) is found within the feature table. When you click on a CDS containing "<tt>/gene_name=XYZ</tt>" or similar, it is therefore XYZ you need to use as name in your FASTA file, not the collective title for the entire GenBank entry (e.g. "<tt>Alpha-A and Alpha-D genes ...</tt>" or "<tt>Yeast Chromosome 2</tt>"). See also [[Media:MultiGeneScreenshot-en.pdf| the screenshot/handout from the exercise]].
* Be aware that in GenBank entries containing several genes (see [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf the GenBank handout from week 2]), the name of the individual gene (CDS) is found within the feature table. When you click on a CDS containing "<tt>/gene_name=XYZ</tt>" or similar, it is therefore XYZ you need to use as name in your FASTA file, not the collective title for the entire GenBank entry (e.g. "<tt>Alpha-A and Alpha-D genes ...</tt>" or "<tt>Yeast Chromosome 2</tt>"). See also [https://teaching.healthtech.dtu.dk/material/22111/MultiGeneScreenshot-en.pdf the screenshot/handout from the exercise].
<!-- * The last GenBank entry ("<tt>AF098919</tt>" - chicken) contains three genes: "<tt>embryonic alpha-type globin pi</tt>", "<tt>adult alpha D globin</tt>" and "<tt>adult alpha A globin</tt>". Here, I have chosen to include only the two last ones, since the first one is described as "alpha-type" instead of "alpha". It is OK to include "embryonic alpha-type globin pi" to avoid discarding too much — if you do, you will see that it stands out as a separate group in the distance tree produced by MAFFT. This is a good indicator that it is something different. You could then optionally go back and discard it, or write a remark about it being separate. -->
<!-- * The last GenBank entry ("<tt>AF098919</tt>" - chicken) contains three genes: "<tt>embryonic alpha-type globin pi</tt>", "<tt>adult alpha D globin</tt>" and "<tt>adult alpha A globin</tt>". Here, I have chosen to include only the two last ones, since the first one is described as "alpha-type" instead of "alpha". It is OK to include "embryonic alpha-type globin pi" to avoid discarding too much — if you do, you will see that it stands out as a separate group in the distance tree produced by MAFFT. This is a good indicator that it is something different. You could then optionally go back and discard it, or write a remark about it being separate. -->
When you build a "real" dataset for a research project, it is often an iterative process, where you 1) collect your data, 2) weed out outliers, 3) run an analysis, and repeat 2) and 3) until you are satisfied with the results.
When you build a "real" dataset for a research project, it is often an iterative process, where you 1) collect your data, 2) weed out outliers, 3) run an analysis, and repeat 2) and 3) until you are satisfied with the results.

Latest revision as of 11:15, 15 March 2024

Click here for Danish version.

Answers to the Multiple Alignment exercise

By: Rasmus Wernersson

Question 1

FASTA file:

>pigeon_alpha-D-globin
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
>pigeon_alpha-A-globin
ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTG
ACTTGGGTGGTGAAGCCCTGGAGAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTT
CGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCT
GCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTG
TGGACCCCGTCAACTTCAAACTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCT
CCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAG
TACCGTTAA
>duck_alpha-D-globin
ATGCTGACCGCCGAGGACAAGAAGCTCATCGTGCAGGTGTGGGAGAAGGTGGCTGGCCACCAGGAGGAAT
TCGGAAGTGAAGCTCTGCAGAGGATGTTCCTCGCCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGA
CCTGCATCCCGGCTCTGAACAGGTCCGTGGCCATGGCAAGAAAGTGGCGGCTGCCCTGGGCAATGCCGTG
AAGAGCCTGGACAACCTCAGCCAGGCCCTGTCTGAGCTCAGCAACCTGCATGCCTACAACCTGCGTGTTG
ACCCTGTCAACTTCAAGCTGCTGGCACAGTGCTTCCAGGTGGTGCTGGCCGCACACCTGGGCAAAGACTA
CAGCCCCGAGATGCATGCTGCCTTTGACAAGTTCTTGTCCGCCGTGGCTGCCGTGCTGGCTGAAAAGTAC
AGATGA
>duck_alpha-A-globin
ATGGTGCTGTCTGCGGCTGACAAGACCAACGTCAAGGGTGTCTTCTCCAAAATCGGTGGCCATGCTGAGG
AGTATGGCGCCGAGACCCTGGAGAGGATGTTCATCGCCTACCCCCAGACCAAGACCTACTTCCCCCACTT
TGACCTGCAGCACGGCTCTGCTCAGATCAAGGCCCATGGCAAGAAGGTGGCGGCTGCCCTAGTTGAAGCT
GTCAACCACATCGATGACATTGCGGGTGCTCTCTCCAAGCTCAGTGACCTCCACGCCCAAAAGCTCCGTG
TGGACCCTGTCAACTTCAAATTCCTGGGCCACTGCTTCCTGGTGGTGGTTGCCATCCACCACCCCGCTGC
CCTGACCCCAGAGGTCCACGCTTCCCTGGACAAGTTCATGTGCGCCGTGGGTGCTGTGCTGACTGCCAAG
TACCGTTAG
>Goat_alpha-i-globin
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCGGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCTCCCCAATGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Goat_alpha-ii-globin
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCAGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCACCCCAGTGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Horse_alpha-1_globin
ATGGTGCTGTCTGCCGCCGACAAGACCAACGTCAAGGCCGCCTGGAGTAAGGTTGGCGGCCACGCTGGCG
AGTTTGGCGCAGAGGCCCTAGAGAGGATGTTCCTGGGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGATCTGAGCCACGGCTCCGCCCAGGTCAAGGCCCACGGCAAGAAGGTGGGCGACGCGCTGACTCTCGCC
GTGGGCCACCTGGACGACCTGCCTGGCGCCCTGTCGAATCTGAGCGACCTGCACGCACACAAGCTGCGCG
TGGACCCCGTCAACTTCAAGCTTCTGAGTCATTGCCTGCTGTCCACCTTGGCCGTCCACCTCCCCAACGA
TTTCACCCCTGCCGTCCACGCCTCCCTGGACAAGTTCTTGAGCAGTGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Horse_alpha-2_globin
ATGGTGCTGTCTGCCGCCGACAAGACCAACGTCAAGGCCGCCTGGAGTAAGGTTGGCGGCCACGCTGGCG
AGTATGGCGCAGAGGCCCTAGAGAGGATGTTCCTGGGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGATCTGAGCCACGGCTCCGCCCAGGTCAAGGCCCACGGCCAGAAGGTGGGCGACGCGCTGACTCTCGCC
GTGGGCCACCTGGACGACCTGCCTGGCGCCCTGTCGAATCTGAGCGACCTGCACGCACACAAGCTGCGCG
TGGACCCCGTCAACTTCAAGCTCCTGAGTCATTGCCTGCTGTCCACCTTGGCCGTCCACCTCCCCAACGA
TTTCACCCCTGCCGTCCACGCCTCCCTGGACAAGTTCTTGAGCAGTGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Chicken_alpha-D
ATGCTGACTGCCGAGGACAAGAAGCTCATCCAGCAGGCCTGGGAGAGGGCCGCTTCCCACCAGGAGGAGT
TTGGAGCTGAGGCTCTGACTAGGATGTTCACCACCTATCCCCAGACCAAGACCTACTTCCCCCACTTCGA
CCTTTCGCCTGGCTCTGACCAGGTCCGTGGCCATGGCAAGAAGGTGTTGGGTGCCCTGGGCAACGCCGTG
AAGAACGTGGACAACCTCAGCCAGGCCATGGCTGAGCTGAGCAACCTGCATGCCTACAACCTGCGTGTTG
ACCCCGTCAATTTCAAGCTGTTGTCGCAGTGCATCCAGGTGGTGCTGGCTGTACACATGGGCAAAGACTA
CACCCCTGAAGTGCATGCTGCCTTCGACAAGTTCCTGTCTGCCGTGTCTGCTGTGCTGGCTGAGAAGTAC
AGATAA
>Chicken_alpha-A
ATGGTGCTGTCCGCTGCTGACAAGAACAACGTCAAGGGCATCTTCACCAAAATCGCCGGCCATGCTGAGG
AGTATGGCGCCGAGACCCTGGAAAGGATGTTCACCACCTACCCCCCAACCAAGACCTACTTCCCCCACTT
CGATCTGTCACACGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTAGTGGCTGCCTTGATCGAGGCT
GCCAACCACATTGATGACATCGCCGGCACCCTCTCCAAGCTCAGCGACCTCCATGCCCACAAGCTCCGCG
TGGACCCTGTCAACTTCAAACTCCTGGGCCAATGCTTCCTGGTGGTGGTGGCCATCCACCACCCTGCTGC
CCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCTTGTGCGCCGTGGGCACTGTGCTGACCGCCAAG
TACCGTTAA

NOTICE:

  • It is essential to use SHORT descriptive names. In the ClustalW format alignment, only the first 15 characters of the names are shown, so if you have very long names the output can be hard to read (see also the FASTA handout from week 2). Notice that JalView fails in a very opaque way if names are not unique within the first 15 characters — it simply appends sequences into one long sequence, if it "thinks" they are named identically!
  • Spaces cannot be part of the names in a FASTA file. If there are spaces, only the first word after ">" counts as the name, subsequent words will be comments. If I had used spaces instead of underscore ("_") in the file above, the names would not have been unique ("duck" would have been used twice, etc.).
  • Be aware that in GenBank entries containing several genes (see the GenBank handout from week 2), the name of the individual gene (CDS) is found within the feature table. When you click on a CDS containing "/gene_name=XYZ" or similar, it is therefore XYZ you need to use as name in your FASTA file, not the collective title for the entire GenBank entry (e.g. "Alpha-A and Alpha-D genes ..." or "Yeast Chromosome 2"). See also the screenshot/handout from the exercise.

When you build a "real" dataset for a research project, it is often an iterative process, where you 1) collect your data, 2) weed out outliers, 3) run an analysis, and repeat 2) and 3) until you are satisfied with the results.

Question 2

  • "*" means that the nucleotides are completely identical in a given position (perfectly conserved).
  • There is a single stretch of >10 nucleotides (23 to be precise) which is perfectly conserved. Its sequence is ACCAAGACCTACTTCCCCCACTT.
  • Concerning "guide tree":
    • 3 clusters: One for Alpha-A (birds only), one for Alpha-D (birds only), and one for Alpha 1 + Alpha 2 (Mammals).
    • The idea is here that birds and mammals are not intermixed, so they are "naturally" placed in a taxonomical sense.
    • Alpha-A and Alpha-D are obviously in two different clusters — that must necessarily mean that the split between them is old. Since both Alpha-A and Alpha-D exist in all the three birds we included, the split must be older than the last common ancestor to the birds.
    • Alpha-1 and Alpha-2 seem to be much more closely related. Remember that a guide tree is only a raw estimate of the phylogeny, so if we want to dig deeper into the time of the split between Alpha-1 and Alpha-2, we need to perform a proper phylogenetic analysis.

Your screenshot of the 3' part of the alignment should look something like this:

Question 3

The sequences are translated using Virtual Ribosome, giving rise to the following FASTA file:

>pigeon_alpha-D-globin
MLTDSDKKLVLQVWEKVIRHPDCGAEALERLFTTYPQTKTYFPHFDLHHGSDQVRNHGKK
VLAALGNAVKSLGNLSQALSDLSDLHAYNLRVDPVNFKLLAQCFHVVLATHLGNDYTPEA
HAAFDKFLSAVCTVLAEKYR*
>pigeon_alpha-A-globin
MVLSANDKSNVKAVFGKIGGQAGDLGGEALERLFITYPQTKTYFPHFDLSHGSAQIKGHG
KKVAEALVEAANHIDDIAGALSKLSDLHAQKLRVDPVNFKLLGHCFLVVVAVHFPSLLTP
EVHASLDKFVCAVGTVLTAKYR*
>duck_alpha-D-globin
MLTAEDKKLIVQVWEKVAGHQEEFGSEALQRMFLAYPQTKTYFPHFDLHPGSEQVRGHGK
KVAAALGNAVKSLDNLSQALSELSNLHAYNLRVDPVNFKLLAQCFQVVLAAHLGKDYSPE
MHAAFDKFLSAVAAVLAEKYR*
>duck_alpha-A-globin
MVLSAADKTNVKGVFSKIGGHAEEYGAETLERMFIAYPQTKTYFPHFDLQHGSAQIKAHG
KKVAAALVEAVNHIDDIAGALSKLSDLHAQKLRVDPVNFKFLGHCFLVVVAIHHPAALTP
EVHASLDKFMCAVGAVLTAKYR*
>Goat_alpha-i-globin
MVLSAADKSNVKAAWGKVGGNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
EKVAAALTKAVGHLDDLPGTLSDLSDLHAHKLRVDPVNFKLLSHSLLVTLACHLPNDFTP
AVHASLDKFLANVSTVLTSKYR*
>Goat_alpha-ii-globin
MVLSAADKSNVKAAWGKVGSNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
EKVAAALTKAVGHLDDLPGTLSDLSDLHAHKLRVDPVNFKLLSHSLLVTLACHHPSDFTP
AVHASLDKFLANVSTVLTSKYR*
>Horse_alpha-1_globin
MVLSAADKTNVKAAWSKVGGHAGEFGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHG
KKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTP
AVHASLDKFLSSVSTVLTSKYR*
>Horse_alpha-2_globin
MVLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHG
QKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTP
AVHASLDKFLSSVSTVLTSKYR*
>Chicken_alpha-D
MLTAEDKKLIQQAWERAASHQEEFGAEALTRMFTTYPQTKTYFPHFDLSPGSDQVRGHGK
KVLGALGNAVKNVDNLSQAMAELSNLHAYNLRVDPVNFKLLSQCIQVVLAVHMGKDYTPE
VHAAFDKFLSAVSAVLAEKYR*
>Chicken_alpha-A
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHFDLSHGSAQIKGHG
KKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRVDPVNFKLLGQCFLVVVAIHHPAALTP
EVHASLDKFLCAVGTVLTAKYR*

Subsequently, they are aligned with MAFFT.

Observations:

  • By and large the same tree on protein level as on DNA level (small differences in the branch lengths).
  • Now, two completely conserved regions of >5 amino acids are seen. Their sequences are TKTYFPHFDL and LRVDPVNFK.

Question 4

FASTA file:

>Sheep_U00659
ATGGCCCTGTGGACACGCCTGGTGCCCCTGCTGGCCCTGCTGGCACTCTGGGCCCCCGCC
CCGGCCCACGCCTTCGTCAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGAGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGCCGGGAGGTGGAGGGC
CCCCAGGTGGGGGCGCTGGAGCTGGCCGGAGGCCCCGGCGCGGGTGGCCTGGAGGGGCCC
CCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCGCCGGCGTCTGCTCTCTCTACCAGCTG
GAGAACTACTGTAACTAG
>Pig_AY044828
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242098
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242100
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242101
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242109
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTAGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Dog_V00179
ATGGCCCTCTGGATGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCG
CCCACCCGAGCCTTCGTTAACCAGCACCTGTGTGGCTCCCACCTGGTAGAGGCTCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCTAAGGCCCGCAGGGAGGTGGAGGAC
CTGCAGGTGAGGGACGTGGAGCTGGCCGGGGCGCCTGGCGAGGGCGGCCTGCAGCCCCTG
GCCCTGGAGGGGGCCCTGCAGAAGCGAGGCATCGTGGAGCAGTGCTGCACCAGCATCTGC
TCCCTCTACCAGCTGGAGAATTACTGCAACTAG
>OwlMonkey_J02989
ATGGCCCTGTGGATGCACCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCCGAG
CCAGCCCCGGCCTTTGTGAACCAGCACCTGTGCGGCCCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGTTTCTTCTACGCACCCAAGACCCGCCGGGAGGCGGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGTGGGGGCTCTATCACGGGCAGCCTGCCACCCTTG
GAGGGTCCCATGCAGAAGCGTGGCGTCGTGGATCAGTGCTGCACCAGCATCTGCTCCCTC
TACCAGCTGCAGAACTACTGCAACTAG
>Human_AY138590
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>GreenMonkey_X61092
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCGGTCCCGGCCTTTGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGCTTCTTCTACACGCCCAAGACCCGCCGGGAGGCAGAGGAC
CCGCAGGTGGGGCAGGTAGAGCTGGGCGGGGGCCCTGGCGCAGGCAGCCTGCAGCCCTTG
GCGCTGGAGGGGTCCCTGCAGAAGCGCGGCATCGTGGAGCAGTGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>Human_J00265
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>Chimp_X61089
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCTCGGCCTTTGTGAACCAACACCTGTGCGGCTCCCACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGTATCGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>GuineaPig_K02233
ATGGCTCTGTGGATGCATCTCCTCACCGTGCTGGCCCTGCTGGCCCTCTGGGGGCCCAAC
ACTAATCAGGCCTTTGTCAGCCGGCATCTGTGCGGCTCCAACTTAGTGGAGACATTGTAT
TCAGTGTGTCAGGATGATGGCTTCTTCTATATACCCAAGGACCGTCGGGAGCTAGAGGAC
CCACAGGTGGAGCAGACAGAACTGGGCATGGGCCTGGGGGCAGGTGGACTACAGCCCTTG
GCACTGGAGATGGCACTACAGAAGCGTGGCATTGTGGATCAGTGCTGTACTGGCACCTGC
ACACGCCACCAGCTGCAGAGCTACTGCAACTAG
>Mouse_X04725
ATGGCCCTGTTGGTGCACTTCCTACCCCTGCTGGCCCTGCTTGCCCTCTGGGAGCCCAAA
CCCACCCAGGCTTTTGTCAAACAGCATCTTTGTGGTCCCCACCTGGTAGAGGCTCTCTAC
CTGGTGTGTGGGGAGCGTGGCTTCTTCTACACACCCAAGTCCCGCCGTGAAGTGGAGGAC
CCACAAGTGGAACAACTGGAGCTGGGAGGAAGCCCCGGGGACCTTCAGACCTTGGCGTTG
GAGGTGGCCCGGCAGAAGCGTGGCATTGTGGATCAGTGCTGCACCAGCATCTGCTCCCTC
TACCAGCTGGAGAACTACTGCAACTAA
>Chicken_AY438372
ATGGCTCTCTGGATCCGATCACTGCCTCTTCTGGCTCTCCTTGTCTTTTCTGGCCCTGGA
ACCAGCTATGCAGCTGCCAACCAGCACCTCTGTGGCTCCCACTTGGTGGAGGCTCTCTAC
CTGGTGTGTGGAGAGCGTGGCTTCTTCTACTCCCCCAAAGCCCGACGGGATGTCGAGCAG
CCCCTAGTGAGCAGTCCCTTGCGTGGCGAGGCAGGAGTGCTGCCTTTCCAGCAGGAGGAA
TACGAGAAAGTCAAGCGAGGGATTGTTGAGCAATGCTGCCATAACACGTGTTCCCTCTAC
CAACTGGAGAACTACTGCAACTAG
>SeaHare_AF160192
ATGAGCAAGTTCCTCCTCCAGAGCCACTCCGCCAACGCCTGCCTGCTCACCCTTCTGCTCACGCTGGCCT
CCAACCTCGACATATCCCTGGCCAACTTCGAGCACTCGTGCAACGGCTACATGCGGCCCCACCCGCGGGG
TCTGTGCGGCGAAGACCTGCACGTCATCATTTCCAACCTGTGCAGCTCTCTGGGGGGCAACAGGAGGTTC
CTGGCCAAGTACATGGTCAAAAGAGACACGGAAAATGTGAACGACAAGTTACGAGGGATCCTGCTCAATA
AGAAAGAAGCTTTCTCCTACTTGACCAAGAGAGAGGCCTCAGGCTCCATCACATGCGAATGTTGCTTCAA
CCAGTGTCGGATATTTGAGCTGGCTCAGTACTGCCGTCTGCCAGACCATTTCTTCTCCAGAATATCCAGA
ACCGGAAGGAGCAACAGTGGACATGCGCAGTTGGAGGACAACTTTAGTTA

Question 5

  • Yes, there are many gaps which are not multiples of 3 positions. The most obvious example is just 1 position long (in all sequences but the Sea Hare, see below). Otherwise, it does not look like all gaps follow codon boundaries, e.g. the first gap starts after four nucleotides, not three. The alignment algorithm is not aware that the sequences are protein coding, it only considers the DNA.
Sheep_U00659    ATCGTGGAGC-AGTGCTGCGCCGGCGTCTGC--------TCTCTCTAC------------
Pig_AY044828    ATCGTGGAGC-AGTGCTGCACCAGCATCTGT--------TCCCTCTAC------------
Pig_AY242098    ATCGTGGAGC-AGTGCTGCACCAGCATCTGT--------TCCCTCTAC------------
Pig_AY242100    ATCGTGGAGC-AGTGCTGCACCAGCATCTGT--------TCCCTCTAC------------
Pig_AY242101    ATCGTGGAGC-AGTGCTGCACCAGCATCTGT--------TCCCTCTAC------------
Pig_AY242109    ATCGTAGAGC-AGTGCTGCACCAGCATCTGT--------TCCCTCTAC------------
OwlMonkey_J0298 GTCGTGGATC-AGTGCTGCACCAGCATCTGC--------TCCCTCTAC------------
Human_AY138590  ATTGTGGAAC-AATGCTGTACCAGCATCTGC--------TCCCTCTAC------------
Human_J00265    ATTGTGGAAC-AATGCTGTACCAGCATCTGC--------TCCCTCTAC------------
Chimp_X61089    ATCGTGGAAC-AATGCTGTACCAGCATCTGC--------TCCCTCTAC------------
GreenMonkey_X61 ATCGTGGAGC-AGTGCTGTACCAGCATCTGC--------TCCCTCTAC------------
Dog_V00179      ATCGTGGAGC-AGTGCTGCACCAGCATCTGC--------TCCCTCTAC------------
Mouse_X04725    ATTGTGGATC-AGTGCTGCACCAGCATCTGC--------TCCCTCTAC------------
GuineaPig_K0223 ATTGTGGATC-AGTGCTGTACTGGCACCTGC--------ACACGCCAC------------
Chicken_AY43837 ATTGTTGAGC-AATGCTGCCATAACACGTGT--------TCCCTCTAC------------
SeaHare_AF16019 ATATTTGAGCTGGCTCAGTACTGCCGTCTGCCAGACCATTTCTTCTCCAGAATATCCAGA
                .*  * ** * ... * *.  .. *.. **.         . . *. *            
  • Sea Hare (a marine snail) stands out — this makes sense, since it is the only invertebrate.
  • It can be seen that the two human sequences are 100% identical (the distance is 0) — one of them can therefore be discarded — and for the pig, the following sequences are identical:
>Pig_AY044828
>Pig_AY242098

and

>Pig_AY242100
>Pig_AY242101

(two pig sequences can therefore be discarded).

Question 6

The sequences are translated using Virtual Ribosome, yielding the following sequences:

>Sheep_U00659
MALWTRLVPLLALLALWAPAPAHAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEG
PQVGALELAGGPGAGGLEGPPQKRGIVEQCCAGVCSLYQLENYCN*
>Pig_AY044828
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242098
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242100
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242101
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242109
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Dog_V00179
MALWMRLLPLLALLALWAPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVED
LQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN*
>OwlMonkey_J02989
MALWMHLLPLLALLALWGPEPAPAFVNQHLCGPHLVEALYLVCGERGFFYAPKTRREAED
LQVGQVELGGGSITGSLPPLEGPMQKRGVVDQCCTSICSLYQLQNYCN*
>Human_AY138590
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>GreenMonkey_X61092
MALWMRLLPLLALLALWGPDPVPAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
PQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>Human_J00265
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>Chimp_X61089
MALWMRLLPLLVLLALWGPDPASAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>GuineaPig_K02233
MALWMHLLTVLALLALWGPNTNQAFVSRHLCGSNLVETLYSVCQDDGFFYIPKDRRELED
PQVEQTELGMGLGAGGLQPLALEMALQKRGIVDQCCTGTCTRHQLQSYCN*
>Mouse_X04725
MALLVHFLPLLALLALWEPKPTQAFVKQHLCGPHLVEALYLVCGERGFFYTPKSRREVED
PQVEQLELGGSPGDLQTLALEVARQKRGIVDQCCTSICSLYQLENYCN*
>Chicken_AY438372
MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHLVEALYLVCGERGFFYSPKARRDVEQ
PLVSSPLRGEAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLENYCN*
>SeaHare_AF160192
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNL
CSSLGGNRRFLAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCR
IFELAQYCRLPDHFFSRISRTGRSNSGHAQLEDNFS*

Subsequently, the sequences are aligned using MAFFT.

  • At the protein level, all the Pig sequences are now completely identical. Four of them can therefore be discarded.

Question 7

Yes, the alignments are different. None of the three methods solves the problem perfectly, but MAFFT is really close; it only places one letter (a Q) incorrectly, see below.

Portion of the MAFFT alignment with Zappo colouring, note the three Q's aligned with E's at position 446.

Question 8

  • Yes — all gaps are multiples of 3.
  • Yes — since the DNA alignment is generated using a protein alignment as a scaffold.