22111 - User contributions [en]

Exercise: BLAST

2024-03-15T16:44:48Z

WikiSysop: /* BLASTP search */

Exercise written by [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] and modified by Henrik Nielsen.

==Introduction==

In this exercise we will be using BLAST ('''B'''asic '''L'''ocal '''A'''lignment '''S'''earch '''T'''ool) for searching sequence databases such as GenBank (DNA data) and UniProt (protein). When using BLAST for sequence searches it is of utmost importance to be able to critically evaluate the statistical significance of the results returned.

The BLAST software package is free to use (Open Source) and can be installed on any local system — it's originally written for UNIX type Operating Systems. The package contains both programs for performing the actual sequence searches against preexisting databases (e.g. "<tt>blastn</tt>" for DNA databases and "<tt>blastp</tt>" for protein databases), as well as a tool for creating new databases from scratch (the "<tt>fortmatdb</tt>" program).

In this exercise we will be using the Web-interface to '''BLAST hosted by the NCBI'''. For our purpose there are several advantages to this approach:
* We don't have to mess around with a UNIX command prompt.
* NCBI offers direct access to preformatted BLAST databases of all the data that they host:
** GenBank (+ derivates)
** Full Genome database
** Protein database (Both from translated GenBank and UniProt)

It should be noted that running BLAST locally (for example at the super-computer cluster at DTU) offers much more fine-grained control of DATA and workflow (everything can be scripted/automated) than running BLAST through a web-interface.

===Links===
* NCBI BLAST main page: http://blast.ncbi.nlm.nih.gov/
** Notice: There are links to "Nucleotide BLAST" (including "blastn") and "Protein BLAST" (including "blastp") from this page.
* NCBI [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs BLAST help pages]

 

<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' BLAST is a quite computationally intensive algorithm, and we have in recent years run into issues with overburdening the NCBI server, with 150+ students submitting jobs at the same time. We have therefore implemented a few optimization/work-arounds, that it is '''important you remember to follow'''. In some of the sections below, you will be asked to limit your search to a certain subset of the BLAST database (e.g. only search in the "bacterial" part of the NR database). This will limit the amount of data to search through, and will make the search finish faster.

[[Image:BLAST_limit_search.png|center|600px|border]]

 
</div>

==Part 1: Your first BLAST search==
<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' do '''NOT''' limit your search to "bacteria" in PART 1 (we are looking for insulin).
</div>

Below is the mRNA sequence for insulin from a South American rodent, the Degu (''Octodon degus'').

>gi|202471|gb|M57671.1|OCOINS Octodon degus insulin mRNA, complete cds
GCATTCTGAGGCATTCTCTAACAGGTTCTCGACCCTCCGCCATGGCCCCGTGGATGCATCTCCTCACCGT
GCTGGCCCTGCTGGCCCTCTGGGGACCCAACTCTGTTCAGGCCTATTCCAGCCAGCACCTGTGCGGCTCC
AACCTAGTGGAGGCACTGTACATGACATGTGGACGGAGTGGCTTCTATAGACCCCACGACCGCCGAGAGC
TGGAGGACCTCCAGGTGGAGCAGGCAGAACTGGGTCTGGAGGCAGGCGGCCTGCAGCCTTCGGCCCTGGA
GATGATTCTGCAGAAGCGCGGCATTGTGGATCAGTGCTGTAATAACATTTGCACATTTAACCAGCTGCAG
AACTACTGCAATGTCCCTTAGACACCTGCCTTGGGCCTGGCCTGCTGCTCTGCCCTGGCAACCAATAAAC
CCCTTGAATGAG

We will now use a BLASTN search at NCBI to determine whether this sequence looks like the human mRNA for insulin. There are two ways we can do this:
* search the entire database and look for human hits in the results,
* specifically search the human part of the database.
We will try both of these possibilities.

=== Search against NR ===

* Follow the "nucleotide blast" link from the main BLAST page.
* In the section "Program Selection" select the option "Somewhat similar sequences (blastn)"
* Choose "Nucleotide collection (nr/nt)" as the search database. NR is the "Non Redundant" database, which contains all non-redundant (non-identical) sequences from GenBank and the full genome databases.
* Click the BLAST button to launch the search.

After the search has completed, make yourself familiar with the BLAST output page. After a header with some information about the search, there are three main parts:
* '''Graphic Summary'''
** each hit is represented by a line showing which part of the query sequence the alignment covers. The lines are coloured according to alignment score.
* '''Descriptions'''
** a table with a one-line description of each hit with some alignment statistics.
* '''Alignments'''
** the actual alignments between the query and the database hits.
Note that you can toggle between hiding and showing each part by clicking on the part title (try it!).

The columns in the '''Descriptions''' table are:
* Description — the description line from the database
* Max score — the alignment score of the best match (local alignment) between the query and the database hit
* Total score — the sum of alignment scores for all matches (alignments) between the query and the database hit (if there is only one match per hit, these two scores are identical)
* Query cover — the percentage of the query sequence that is covered by the alignment(s)
* E value — the Expect value calculated from the Max score (''i.e.'' the number of ''unrelated'' hits with that score or better you would expect to find for random reasons)
* Ident — the percent identity in the alignment(s)
* Accession — the accession number of the database hit.

First, take a look at the best hit. Since our search sequence (the query) was taken from GenBank which is part of NR, we should find an identical sequence in the search. Make sure this is the case!

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.1''':
:Answer the following questions about the best hit:
:* what is the identifier (Accession)?
:* what is the alignment score ("max score")?
:* what is the percent identity and query coverage?
:* what is the E-value?
:* are there any gaps in the alignment?

Then, find the best hit from human (''Homo sapiens'') that is ''not'' a synthetic construct. ('''Tip:''' you can press Ctrl-F in most browsers to search in the page).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.2''':
Answer the same questions as before about the hit you found now.

=== Search against Human G+T ===

'''Note:''' In this context, G+T does not mean Gin and Tonic.

Open ''a new window/tab'' with the BLAST home page. Make a new BLASTN search with the same query sequence, this time with Database set to Human genomic + transcript (Human G+T). Remember again to select Somewhat similar sequences (blastn) under Program Selection. Consider the best hit.

'''Note:''' even though you may not have found exactly the same database entry in the two searches, the ''alignment'' should be the same. Make sure this is the case by comparing the actual alignments in the two windows where you made the searches.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.3''':
Answer the same questions as before about the best hit you found in this search.

===Concerning database size and E-values===

When answering the previous two questions, you may have noticed that the E-value changed, while the alignment score did not. We will now investigate this further.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.4''':
What are the sizes (in basepairs) of the databases we used for the two BLAST searches? ('''Tip:''' Expand the "Search summary" section near the top by clicking it).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.5''':
:'''Hint:''' remember, you can use Google as a calculator!
:*What is the ratio between the database sizes in the two BLAST searches?
:*What is the ratio between the E-values (for the best human hits) in the two BLAST searches?
:*What is the relationship between database size and E-value for hits with identical alignment score?
:*In conclusion: if the database size is doubled, what will happen to the E-value?

==Part 2: Assessing the statistical significance of BLAST hits==
<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' limit your search to "bacteria" (taxid: 2) in ALL of this section (PART 2) to make the BLAST searches run quicker.
</div>

As discussed in the lecture, there will be a risk of getting false positive results (hits to sequences that are not related to our input sequence) by purely stochastic means. In this first part of the exercise we will be investigating this further, by examining what happens when we submit randomly generated sequence to BLAST searches.

Rather than giving out a set of pre-generated DNA/Peptide sequences where you only have our word for their randomness, you'll be generating your own random sequences with the [http://www.bioinformatics.org/sms2/ Sequence Manipulation Suite]. We previously used d4/d20 dice to generate these sequences manually, but we have decided to let the computer do the work in order for you to save some time.
It is important to understand that these computer generated sequences are ''totally random'', just as if you were rolling a die to determine each nucleotide/amino acid in each sequence.

===Random DNA sequences and BLASTN===

*Generate three DNA sequences of length 25bp using [http://www.bioinformatics.org/sms2/random_dna.html the random DNA generator] from the [http://www.bioinformatics.org/sms2/ Sequence Manipulation Suite]. '''Note:''' three is not an option, so just generate ten sequences and copy the first three.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.1''':
:Report the three sequences in '''FASTA''' format.

We will now do a BLASTN search using these three random sequences as queries. Follow the "nucleotide blast" link from the main BLAST page, and, as before, select the option "Somewhat similar sequences (blastn)" in the section "Program Selection". Choose "Nucleotide Collection (nr/nt)" as the search database.

'''VERY IMPORTANT''':
For this special situation where we BLAST small artificial sequences we need to turn off some the automatics NCBI incorporate when short sequences are detected. Otherwise we'll not be able to see the intended results:

* Extend the "Algorithm parameters" section (see the screen shot below) in order to gain access to fine-tuning the options.
*# '''Deselect''' the "Automatically adjust parameters for short input sequences" option.
*# Set the E-value cut-off ("Expect threshold") to '''50'''

[[file:Blastn_cropped+circle.png‎|center|frame|'''Remember to adjust the BLAST settings''']]

* Paste in your three sequences in FASTA format and start the BLAST search.

[[file:NCBI_BALST_select_seq.png|frame|'''Browsing BLAST results''': select which of your query sequences to inspect in the drop-down box near the top of the page]]
* Inspect the results.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.2''':
:Answer the following small questions, and '''document your findings''' by pasting in examples of alignments / text snippets from the overview table:
:* Do you find any sequences that look like your input sequences (paste in a few example alignments in your report).
:* What is the typical length of the hits (the alignment length)?
:* What is the typical % identity?
:* In what range is the bit-scores ("max score")?
:** ''Notice: This is conceptually the same as the "alignment score" we have already met in the pairwise alignment exercise''.
:* What is the range of the E-values?

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.3''':
:*What is the '''biological''' significance of these hits / is there any biological meaning?

===Random protein sequences and BLASTP===
Now it's time to work with a set of '''protein sequences''': Generate three peptide sequences of length 25aa using [http://www.bioinformatics.org/sms2/random_protein.html the random protein generator].

* '''Notice 1:''' The distribution of amino acids will be equal (5% prob) and this is different from true biological sequences - however this is not important for this first part of the exercise.
* '''Notice 2:''' Please recall from the lecture that the way <tt>BLASTP</tt> selects candidate sequences for full Smith-Waterman alignment is different from <tt>BLASTN</tt>. (<tt>BLASTN</tt> - a single short (11 bp +) perfect match hit is needed. <tt>BLASTP</tt> - a pair of "near match" hits of 3 aa within a 40 aa window is needed).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.4''':
:Report the sequences in FASTA format.

Locate the "Protein BLAST" page at NCBI and choose blastp as the algorithm to use.

Paste in your sequences in FASTA format, and choose the "NR" database (this is the protein version, consisting of translated CDS'es, UniProt etc).

'''VERY IMPORTANT''': We also need to tweak the parameters this time - in the "Algorithm Parameters" section select BLOSUM62 as the alignment matrix to use and set the "Expect threshold" to 1000 (default: 10) - and DISABLE the "Short queries" parameters as we did in the DNA search a moment ago - otherwise our carefully tweaked parameters will be ignored.

* Perform the BLAST search.
* Inspect the results.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.5''':
:''(Remember to '''document your answers''' in the same manner as Q2.2)''

:* What is the typical length of the alignment and do they contain gaps?
:* What is the range of E-values?
:* Try to inspect a few of the alignments in details ("+" means similar sequences) - do you find any that look plausible, if we for a moment ignore the length/E-value?
:* If we had used the default E-value cut-off of 10 would any hits have been found?

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.6''':
:* If we compare the result from BLAST'ing random DNA sequences to random Peptide sequences - which kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?
:** Remember to take E-values into your consideration.

 

==Part 3: using BLAST to transfer functional information by finding homologs==
<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' limit your search to "bacteria" (taxid: 2) in ALL of this section (PART 3) to make the BLAST searches run quicker. (The organisms we're looking for all belongs to the "Bacteria" domain of life, so this restriction is OK).
</div>

===Homo-, Ortho- and Paralogs===

One of the most common ways to use BLAST as a tool, is in the situation where you have a sequence of '''unknown function''', and want to '''find out which function it has'''. Since a large amount of sequence data has been gathered during the years, chances are that an '''evolutionarily related''' sequence with known function has already been identified. In general such a related sequence is known as a "'''homolog'''".

Homo-, Ortho- and Paralogs:
* A '''Homolog''' is a general term that describes a sequence that is related by any evolutionary means.
* An '''Ortholog''' ("Ortho" = True) is a sequence that is "the same gene" in a different organism: The sequences shared a single common ancestor sequence, and has now diverged through speciation (e.g. the Alpha-globin gene in Human and Mouse).
* A '''Paralog''' arises due to a gene duplication within a species. For example Alpha- and Beta-globin are each others paralogs.

[[File:Homo_Ortho_Para-log.gif|center|frame|''Image source: [http://www.thegreatgoodplace.com/tt/gwlee/126 gwLee's blog]'' ]]

Notice that in both cases it's possible to transfer information, for example information about gene family / protein domains.
We have already touched upon comparison of (potentially) evolutionarily related sequences in the pairwise alignment exercise. However, this time we do not start out with two sequences we assume are related, but we rather start out with a single sequence ("query sequence") which we will use to search the databases for homologs (we often informally speak of "BLAST hits", when discussing the sequences found).

 

===BLAST example 1===

Let's start out with a sequence that will produce some good hits in the database. The sequence below is a full-length transcript (mRNA) from a prokaryote. Let's find out what it is.

>Unknown_transcript01
CCACTTGAAACCGTTTTAATCAAAAACGAAGTTGAGAAGATTCAGTCAACTTAACGTTAATATTTGTTTC
CCAATAGGCAAATCTTTCTAACTTTGATACGTTTAAACTACCAGCTTGGACAAGTTGGTATAAAAATGAG
GAGGGAACCGAATGAAGAAACCGTTGGGGAAAATTGTCGCAAGCACCGCACTACTCATTTCTGTTGCTTT
TAGTTCATCGATCGCATCGGCTGCTGAAGAAGCAAAAGAAAAATATTTAATTGGCTTTAATGAGCAGGAA
GCTGTTAGTGAGTTTGTAGAACAAGTAGAGGCAAATGACGAGGTCGCCATTCTCTCTGAGGAAGAGGAAG
TCGAAATTGAATTGCTTCATGAATTTGAAACGATTCCTGTTTTATCCGTTGAGTTAAGCCCAGAAGATGT
GGACGCGCTTGAACTCGATCCAGCGATTTCTTATATTGAAGAGGATGCAGAAGTAACGACAATGGCGCAA
TCAGTGCCATGGGGAATTAGCCGTGTGCAAGCCCCAGCTGCCCATAACCGTGGATTGACAGGTTCTGGTG
TAAAAGTTGCTGTCCTCGATACAGGTATTTCCACTCATCCAGACTTAAATATTCGTGGTGGCGCTAGCTT
TGTACCAGGGGAACCATCCACTCAAGATGGGAATGGGCATGGCACGCATGTGGCCGGGACGATTGCTGCT
TTAAACAATTCGATTGGCGTTCTTGGCGTAGCGCCGAGCGCGGAACTATACGCTGTTAAAGTATTAGGGG
CGAGCGGTTCAGGTTCGGTCAGCTCGATTGCCCAAGGATTGGAATGGGCAGGGAACAATGGCATGCACGT
TGCTAATTTGAGTTTAGGAAGCCCTTCGCCAAGTGCCACACTTGAGCAAGCTGTTAATAGCGCGACTTCT
AGAGGGGTTCTTGTTGTAGCGGCATCTGGGAATTCAGGTGCAGGCTCAATCAGCTATCCGGCCCGTTATG
CGAACGCAATGGCAGTCGGAGCGACTGACCAAAACAACAACCGCGCCAGCTTTTCACAGTATGGCGCAGG
GCTTGACATTGTCGCACCAGGTGTAAACGTGCAGAGCACATACCCAGGTTCAACGTATGCCAGCTTAAAC
GGTACATCGATGGCTACTCCTCATGTTGCAGGTGCAGCAGCCCTTGTTAAACAAAAGAACCCATCTTGGT
CCAATGTACAAATCCGCAATCATCTAAAGAATACGGCAACGAGCTTAGGAAGCACGAACTTGTATGGAAG
CGGACTTGTCAATGCAGAAGCGGCAACACGCTAATCAATAATAATAGGAGCTGTCCCAAAAGGTCATAGA
TAAATGACCTTTTGGGGTGGCTTTTTTACATTTGGATAAAAAAGCACAAAAAAATCGCCTCATCGTTTAA
AATGAAGGTACC

====BLASTN search====
Perform a BLAST search in the NR/NT database (BLASTN) using default settings. Remember to set Expect threshold back to the default value, 10. ('''2021 update:''' The new default is 0.05, that should work fine as well).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 3.1''':
:''(Once again remember to document your findings)''
:* Do we get any significant hits?
:* What kind of genes (function) do we find?

====BLASTP search====
Now let's try to do the same at the protein level.
* Find the longest ORF using [https://services.healthtech.dtu.dk/services/VirtualRibosome-2.0/ VirtualRibosome] (hint: remember to search all positive reading frames) and save of copy the sequence in FASTA format.
* BLAST the sequence (BLASTP) against the NR database.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 3.2''':
:(Document!)
:* Report your translated protein sequence in FASTA format.
:* Do we find any conserved protein domains? (''Click the Graphic Summary tab''). Identifying known protein domains can provide important clues to the function of an unknown protein.
:* Do we find any significant hits? (E-value?)
:* Are all the best hits the same category of enzymes?
:* From what you have seen, what is best for identifying intermediate quality hits - DNA or Protein BLAST?

 

===BLAST example 2===

In the previous section we have been cheating a bit by using a sequence that was already in the database - let's move on to the following sequence instead.

The sequence is a '''DNA fragment''' from an unknown non-cultivatable microorganism. It was cloned and sequenced directly from DNA extracted from a soil-sample, and it goes by the poetic name "CLONE12". It was amplified using degenerated PCR primers that target the middle ("core cloning") of the sequence of '''a group of known enzymes'''. (I can guarantee this particular sequence is not in the BLAST databases, since I have cloned and sequenced it myself, and it has never been submitted to GenBank).

LOCUS CLONE12.DNA 609 BP DS-DNA UPDATED 06/14/98
DEFINITION UWGCG file capture
ACCESSION -
KEYWORDS -
SOURCE -
COMMENT Non-sequence data from original file:
BASE COUNT 174 A 116 C 162 G 157 T 0 OTHER
ORIGIN ?
clone12.dna Length: 609 Jun 13, 1998 - 03:39 PM Check: 6014 ..
1 AACGGGCACG GGACGCATGT AGCTGGAACA GTGGCAGCCG TAAATAATAA TGGTATCGGA
61 GTTGCCGGGG TTGCAGGAGG AAACGGCTCT ACCAATAGTG GAGCAAGGTT AATGTCCACA
121 CAAATTTTTA ATAGTGATGG GGATTATACA AATAGCGAAA CTCTTGTGTA CAGAGCCATT
181 GTTTATGGTG CAGATAACGG AGCTGTGATC TCGCAAAATA GCTGGGGTAG TCAGTCTCTG
241 ACTATTAAGG AGTTGCAGAA AGCTGCGATC GACTATTTCA TTGATTATGC AGGAATGGAC
301 GAAACAGGAG AAATACAGAC AGGCCCTATG AGGGGAGGTA TATTTATAGC TGCCGCCGGA
361 AACGATAACG TTTCCACTCC AAATATGCCT TCAGCTTATG AACGGGTTTT AGCTGTGGCC
421 TCAATGGGAC CAGATTTTAC TAAGGCAAGC TATAGCACTT TTGGAACATG GACTGATATT
481 ACTGCTCCTG GCGGAGATAT TGACAAATTT GATTTGTCAG AATACGGAGT TCTCAGCACT
541 TATGCCGATA ATTATTATGC TTATGGAGAG GGAACATCCA TGGCTTGTCC ACATGTCGCC
601 GGCGCCGCC
//

[[Image:Office-notes-line_drawing.png|30px|left]][[Image:Cogs_brain.png|right|150px]]
:'''QUESTION 3.3 (Long question - read all)''':
: ''Your task is now to find out '''what kind of enzyme''' this sequence is likely to encode, '''using the methods''' you have learned''.

'''INSTRUCTIONS''': You are free to write the combined answer to this question in a free-style essay-like fashion - just be sure to include the subquestions in your answers. In an exam situation you will need to put all the clues together yourself, reason about the tools/databases to use, and document your findings.

'''STEP 1 - cleaning up the sequence''':

The sequence is (more or less) in GenBank format and the NCBI BLAST server expects the input to be in FASTA format, or to be "raw" unformatted sequence.

* There are two solutions to this:
** Copy the sequence into a text-editor and manually create a FASTA file ("search and replace" and/or "rectangular selection" is useful for the reformatting). This is the most robust solution: it will always work. (Look at the JEdit exercise for a reminder of how to do this).
** Hope the creators of the web-server you're using were kind enough to automatically remove non-DNA letters (paste in ONLY the DNA lines) - this turns out to be the case for both NCBI BLAST and VirtualRibosome, but it ''cannot be universally relied upon''.

'''Subquestion''': convert the sequence to FASTA format (manually, in JEdit) and quote it in your report.

'''STEP 2 - thinking about the task''':

Consider the following before you start on solving this task:
* Based on the information given: is the sequence protein-coding?
* If it is, can you trust it will contain both a START and STOP codon?
* Do we know if the sequence is sense or anti-sense?
and think which consequences the answers to these questions should have for your choice of methods and parameters.

'''Subquestion''': Give a summary of your considerations.

'''STEP 3 - Performing the database search''':

''Significance'': We will put the criteria for significance at 1e-10 (remember: the higher the E-value, the worse the significance).

'''Subquestion''':

Cover the following in your answer:
* What tool(s) and database(s) will be relevant to use?
* Document the results from the different BLAST searches - what works and what does not work?
* You need to copy in small snippets of the BLAST results to document what you observe.
* '''In conclusion''': What kind of enzyme is CLONE12? Gather as much evidence as possible.

 

==Part 4: BLAST'ing Genomes==
<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' do '''NOT''' limit your search to "bacteria" here - now we are actively looking at organism specific queries.
</div>

So far we have been using BLAST to search in the big broad databases that covers at huge set of sequence from a large range of organisms. In this final part of the exercise we will be doing some more focused searches in smaller databases by targeting specific genomes.

Typically this will be useful if you have a gene of known function from one organism (say a cell-cycle controlling gene from Yeast, ''Saccharomyces cerevisiae'') and want to find the human homolog/ortholog to this gene (genes that control cell division are often involved in cancer).

When you have been performing the BLAST searches, you have probably already noticed, that's it possible to search specifically in the Human and Mouse genomes (these database only contains sequences from Human/Mouse). It's also possible to restrict the output from searches in the large databases (e.g. NR) to specific organisms.

A growing number of organisms have been fully sequenced, and the research teams responsible for a large scale genome project typically put up their own Web resources for accessing the data. For example the Yeast genome is principally hosted in the Saccharomyces Genome Database (SGD - www.yeastgenome.org) - it should be noted that SGD also offers BLAST as a means to search the database.



===Genome specific analysis of histones===
====SGD====
Let's do a small study of the relationship between the histones found in Yeast and in Human (evolutionary distance: ~1-1.5 billion years).

Look up the '''HTA2''' gene in SGD (http://www.yeastgenome.org - use the search box at the top of the page). Notice that a brief description about the function of the gene and its protein product is displayed (a huge amount of additional information can be found further down the page - much of it Yeast specific).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 4.1''':
:What information is given about the relationship between this gene and the gene "HTA1"?

Browse the page and locate the link to the protein sequence. Save the sequence as a file, '''we'll need it in a moment'''.

====NCBI====


Now return to the NCBI blastp page. Set Database to "Reference proteins (refseq_protein)", and enter <tt>Saccharomyces cerevisiae</tt> in the Organism field (and accept the suggestion with taxid:4932).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 4.2''':
:''(Remember to document your answers)''
:* How many high-confidence hits do we get?
:* Do the hits make sense, from what you have read about HTA2 at the SGD webpage?
:'''Tip:''' click on the Gene links under Related Information (to the right of the alignments) to see the gene names for the protein hits.

The next step is to search the translated version of the human genome.

Do as before, still with Database set to "Reference proteins (refseq_protein)", just enter <tt>Human</tt> in the Organism field.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 4.3''':
:* How many high-confidence hits (with E-value better than 10-10) are found? (Approximately)
:* What are all the high-confidence hits called?




==Concluding remarks==
Today we have been using BLAST to find a number of homologous genes (and protein-products). If we want to go even deeper into the analysis of the homologs, the next logical step would be to build a dataset of the full-length versions of the sequences we have found (not just the part found by the local alignment in BLAST).

A further analysis could consist of a series of pairwise alignments (for finding out what is similar/different between pairs of sequences) or a multiple alignment which could form the basis of establishing the evolutionary relationship between the entire set of sequences.

BLAST can also be used as way to build a dataset of sequences base on a known "seed" sequence. As we saw in the GenBank exercise, free-text searching in the GenBank can be difficult, and if we for instance wanted to build a dataset of variants of the insulin gene, an easiy way to go around this would be to BLAST the normal version of the insulin against the sequence database of choice, and pick the best matching hits from here.

File:NCBI BALST select seq.png

2024-03-15T16:43:19Z

WikiSysop:

File:Blastn cropped+circle.png

2024-03-15T16:42:53Z

WikiSysop:

Exercise: BLAST

2024-03-15T16:41:54Z

WikiSysop: Created page with "Exercise written by [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] and modified by Henrik Nielsen. ==Introduction== In this exercise we will be using BLAST ('''B'''asic '''L'''ocal '''A'''lignment '''S'''earch '''T'''ool) for searching sequence databases such as GenBank (DNA data) and UniProt (protein). When using BLAST for sequence searches it is of utmost importance to be able to critically evaluate t..."

Exercise written by [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] and modified by Henrik Nielsen.

==Introduction==

In this exercise we will be using BLAST ('''B'''asic '''L'''ocal '''A'''lignment '''S'''earch '''T'''ool) for searching sequence databases such as GenBank (DNA data) and UniProt (protein). When using BLAST for sequence searches it is of utmost importance to be able to critically evaluate the statistical significance of the results returned.

The BLAST software package is free to use (Open Source) and can be installed on any local system — it's originally written for UNIX type Operating Systems. The package contains both programs for performing the actual sequence searches against preexisting databases (e.g. "<tt>blastn</tt>" for DNA databases and "<tt>blastp</tt>" for protein databases), as well as a tool for creating new databases from scratch (the "<tt>fortmatdb</tt>" program).

In this exercise we will be using the Web-interface to '''BLAST hosted by the NCBI'''. For our purpose there are several advantages to this approach:
* We don't have to mess around with a UNIX command prompt.
* NCBI offers direct access to preformatted BLAST databases of all the data that they host:
** GenBank (+ derivates)
** Full Genome database
** Protein database (Both from translated GenBank and UniProt)

It should be noted that running BLAST locally (for example at the super-computer cluster at DTU) offers much more fine-grained control of DATA and workflow (everything can be scripted/automated) than running BLAST through a web-interface.

===Links===
* NCBI BLAST main page: http://blast.ncbi.nlm.nih.gov/
** Notice: There are links to "Nucleotide BLAST" (including "blastn") and "Protein BLAST" (including "blastp") from this page.
* NCBI [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs BLAST help pages]

 

<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' BLAST is a quite computationally intensive algorithm, and we have in recent years run into issues with overburdening the NCBI server, with 150+ students submitting jobs at the same time. We have therefore implemented a few optimization/work-arounds, that it is '''important you remember to follow'''. In some of the sections below, you will be asked to limit your search to a certain subset of the BLAST database (e.g. only search in the "bacterial" part of the NR database). This will limit the amount of data to search through, and will make the search finish faster.

[[Image:BLAST_limit_search.png|center|600px|border]]

 
</div>

==Part 1: Your first BLAST search==
<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' do '''NOT''' limit your search to "bacteria" in PART 1 (we are looking for insulin).
</div>

Below is the mRNA sequence for insulin from a South American rodent, the Degu (''Octodon degus'').

>gi|202471|gb|M57671.1|OCOINS Octodon degus insulin mRNA, complete cds
GCATTCTGAGGCATTCTCTAACAGGTTCTCGACCCTCCGCCATGGCCCCGTGGATGCATCTCCTCACCGT
GCTGGCCCTGCTGGCCCTCTGGGGACCCAACTCTGTTCAGGCCTATTCCAGCCAGCACCTGTGCGGCTCC
AACCTAGTGGAGGCACTGTACATGACATGTGGACGGAGTGGCTTCTATAGACCCCACGACCGCCGAGAGC
TGGAGGACCTCCAGGTGGAGCAGGCAGAACTGGGTCTGGAGGCAGGCGGCCTGCAGCCTTCGGCCCTGGA
GATGATTCTGCAGAAGCGCGGCATTGTGGATCAGTGCTGTAATAACATTTGCACATTTAACCAGCTGCAG
AACTACTGCAATGTCCCTTAGACACCTGCCTTGGGCCTGGCCTGCTGCTCTGCCCTGGCAACCAATAAAC
CCCTTGAATGAG

We will now use a BLASTN search at NCBI to determine whether this sequence looks like the human mRNA for insulin. There are two ways we can do this:
* search the entire database and look for human hits in the results,
* specifically search the human part of the database.
We will try both of these possibilities.

=== Search against NR ===

* Follow the "nucleotide blast" link from the main BLAST page.
* In the section "Program Selection" select the option "Somewhat similar sequences (blastn)"
* Choose "Nucleotide collection (nr/nt)" as the search database. NR is the "Non Redundant" database, which contains all non-redundant (non-identical) sequences from GenBank and the full genome databases.
* Click the BLAST button to launch the search.

After the search has completed, make yourself familiar with the BLAST output page. After a header with some information about the search, there are three main parts:
* '''Graphic Summary'''
** each hit is represented by a line showing which part of the query sequence the alignment covers. The lines are coloured according to alignment score.
* '''Descriptions'''
** a table with a one-line description of each hit with some alignment statistics.
* '''Alignments'''
** the actual alignments between the query and the database hits.
Note that you can toggle between hiding and showing each part by clicking on the part title (try it!).

The columns in the '''Descriptions''' table are:
* Description — the description line from the database
* Max score — the alignment score of the best match (local alignment) between the query and the database hit
* Total score — the sum of alignment scores for all matches (alignments) between the query and the database hit (if there is only one match per hit, these two scores are identical)
* Query cover — the percentage of the query sequence that is covered by the alignment(s)
* E value — the Expect value calculated from the Max score (''i.e.'' the number of ''unrelated'' hits with that score or better you would expect to find for random reasons)
* Ident — the percent identity in the alignment(s)
* Accession — the accession number of the database hit.

First, take a look at the best hit. Since our search sequence (the query) was taken from GenBank which is part of NR, we should find an identical sequence in the search. Make sure this is the case!

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.1''':
:Answer the following questions about the best hit:
:* what is the identifier (Accession)?
:* what is the alignment score ("max score")?
:* what is the percent identity and query coverage?
:* what is the E-value?
:* are there any gaps in the alignment?

Then, find the best hit from human (''Homo sapiens'') that is ''not'' a synthetic construct. ('''Tip:''' you can press Ctrl-F in most browsers to search in the page).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.2''':
Answer the same questions as before about the hit you found now.

=== Search against Human G+T ===

'''Note:''' In this context, G+T does not mean Gin and Tonic.

Open ''a new window/tab'' with the BLAST home page. Make a new BLASTN search with the same query sequence, this time with Database set to Human genomic + transcript (Human G+T). Remember again to select Somewhat similar sequences (blastn) under Program Selection. Consider the best hit.

'''Note:''' even though you may not have found exactly the same database entry in the two searches, the ''alignment'' should be the same. Make sure this is the case by comparing the actual alignments in the two windows where you made the searches.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.3''':
Answer the same questions as before about the best hit you found in this search.

===Concerning database size and E-values===

When answering the previous two questions, you may have noticed that the E-value changed, while the alignment score did not. We will now investigate this further.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.4''':
What are the sizes (in basepairs) of the databases we used for the two BLAST searches? ('''Tip:''' Expand the "Search summary" section near the top by clicking it).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 1.5''':
:'''Hint:''' remember, you can use Google as a calculator!
:*What is the ratio between the database sizes in the two BLAST searches?
:*What is the ratio between the E-values (for the best human hits) in the two BLAST searches?
:*What is the relationship between database size and E-value for hits with identical alignment score?
:*In conclusion: if the database size is doubled, what will happen to the E-value?

==Part 2: Assessing the statistical significance of BLAST hits==
<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' limit your search to "bacteria" (taxid: 2) in ALL of this section (PART 2) to make the BLAST searches run quicker.
</div>

As discussed in the lecture, there will be a risk of getting false positive results (hits to sequences that are not related to our input sequence) by purely stochastic means. In this first part of the exercise we will be investigating this further, by examining what happens when we submit randomly generated sequence to BLAST searches.

Rather than giving out a set of pre-generated DNA/Peptide sequences where you only have our word for their randomness, you'll be generating your own random sequences with the [http://www.bioinformatics.org/sms2/ Sequence Manipulation Suite]. We previously used d4/d20 dice to generate these sequences manually, but we have decided to let the computer do the work in order for you to save some time.
It is important to understand that these computer generated sequences are ''totally random'', just as if you were rolling a die to determine each nucleotide/amino acid in each sequence.

===Random DNA sequences and BLASTN===

*Generate three DNA sequences of length 25bp using [http://www.bioinformatics.org/sms2/random_dna.html the random DNA generator] from the [http://www.bioinformatics.org/sms2/ Sequence Manipulation Suite]. '''Note:''' three is not an option, so just generate ten sequences and copy the first three.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.1''':
:Report the three sequences in '''FASTA''' format.

We will now do a BLASTN search using these three random sequences as queries. Follow the "nucleotide blast" link from the main BLAST page, and, as before, select the option "Somewhat similar sequences (blastn)" in the section "Program Selection". Choose "Nucleotide Collection (nr/nt)" as the search database.

'''VERY IMPORTANT''':
For this special situation where we BLAST small artificial sequences we need to turn off some the automatics NCBI incorporate when short sequences are detected. Otherwise we'll not be able to see the intended results:

* Extend the "Algorithm parameters" section (see the screen shot below) in order to gain access to fine-tuning the options.
*# '''Deselect''' the "Automatically adjust parameters for short input sequences" option.
*# Set the E-value cut-off ("Expect threshold") to '''50'''

[[file:Blastn_cropped+circle.png‎|center|frame|'''Remember to adjust the BLAST settings''']]

* Paste in your three sequences in FASTA format and start the BLAST search.

[[file:NCBI_BALST_select_seq.png|frame|'''Browsing BLAST results''': select which of your query sequences to inspect in the drop-down box near the top of the page]]
* Inspect the results.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.2''':
:Answer the following small questions, and '''document your findings''' by pasting in examples of alignments / text snippets from the overview table:
:* Do you find any sequences that look like your input sequences (paste in a few example alignments in your report).
:* What is the typical length of the hits (the alignment length)?
:* What is the typical % identity?
:* In what range is the bit-scores ("max score")?
:** ''Notice: This is conceptually the same as the "alignment score" we have already met in the pairwise alignment exercise''.
:* What is the range of the E-values?

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.3''':
:*What is the '''biological''' significance of these hits / is there any biological meaning?

===Random protein sequences and BLASTP===
Now it's time to work with a set of '''protein sequences''': Generate three peptide sequences of length 25aa using [http://www.bioinformatics.org/sms2/random_protein.html the random protein generator].

* '''Notice 1:''' The distribution of amino acids will be equal (5% prob) and this is different from true biological sequences - however this is not important for this first part of the exercise.
* '''Notice 2:''' Please recall from the lecture that the way <tt>BLASTP</tt> selects candidate sequences for full Smith-Waterman alignment is different from <tt>BLASTN</tt>. (<tt>BLASTN</tt> - a single short (11 bp +) perfect match hit is needed. <tt>BLASTP</tt> - a pair of "near match" hits of 3 aa within a 40 aa window is needed).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.4''':
:Report the sequences in FASTA format.

Locate the "Protein BLAST" page at NCBI and choose blastp as the algorithm to use.

Paste in your sequences in FASTA format, and choose the "NR" database (this is the protein version, consisting of translated CDS'es, UniProt etc).

'''VERY IMPORTANT''': We also need to tweak the parameters this time - in the "Algorithm Parameters" section select BLOSUM62 as the alignment matrix to use and set the "Expect threshold" to 1000 (default: 10) - and DISABLE the "Short queries" parameters as we did in the DNA search a moment ago - otherwise our carefully tweaked parameters will be ignored.

* Perform the BLAST search.
* Inspect the results.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.5''':
:''(Remember to '''document your answers''' in the same manner as Q2.2)''

:* What is the typical length of the alignment and do they contain gaps?
:* What is the range of E-values?
:* Try to inspect a few of the alignments in details ("+" means similar sequences) - do you find any that look plausible, if we for a moment ignore the length/E-value?
:* If we had used the default E-value cut-off of 10 would any hits have been found?

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 2.6''':
:* If we compare the result from BLAST'ing random DNA sequences to random Peptide sequences - which kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?
:** Remember to take E-values into your consideration.

 

==Part 3: using BLAST to transfer functional information by finding homologs==
<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' limit your search to "bacteria" (taxid: 2) in ALL of this section (PART 3) to make the BLAST searches run quicker. (The organisms we're looking for all belongs to the "Bacteria" domain of life, so this restriction is OK).
</div>

===Homo-, Ortho- and Paralogs===

One of the most common ways to use BLAST as a tool, is in the situation where you have a sequence of '''unknown function''', and want to '''find out which function it has'''. Since a large amount of sequence data has been gathered during the years, chances are that an '''evolutionarily related''' sequence with known function has already been identified. In general such a related sequence is known as a "'''homolog'''".

Homo-, Ortho- and Paralogs:
* A '''Homolog''' is a general term that describes a sequence that is related by any evolutionary means.
* An '''Ortholog''' ("Ortho" = True) is a sequence that is "the same gene" in a different organism: The sequences shared a single common ancestor sequence, and has now diverged through speciation (e.g. the Alpha-globin gene in Human and Mouse).
* A '''Paralog''' arises due to a gene duplication within a species. For example Alpha- and Beta-globin are each others paralogs.

[[File:Homo_Ortho_Para-log.gif|center|frame|''Image source: [http://www.thegreatgoodplace.com/tt/gwlee/126 gwLee's blog]'' ]]

Notice that in both cases it's possible to transfer information, for example information about gene family / protein domains.
We have already touched upon comparison of (potentially) evolutionarily related sequences in the pairwise alignment exercise. However, this time we do not start out with two sequences we assume are related, but we rather start out with a single sequence ("query sequence") which we will use to search the databases for homologs (we often informally speak of "BLAST hits", when discussing the sequences found).

 

===BLAST example 1===

Let's start out with a sequence that will produce some good hits in the database. The sequence below is a full-length transcript (mRNA) from a prokaryote. Let's find out what it is.

>Unknown_transcript01
CCACTTGAAACCGTTTTAATCAAAAACGAAGTTGAGAAGATTCAGTCAACTTAACGTTAATATTTGTTTC
CCAATAGGCAAATCTTTCTAACTTTGATACGTTTAAACTACCAGCTTGGACAAGTTGGTATAAAAATGAG
GAGGGAACCGAATGAAGAAACCGTTGGGGAAAATTGTCGCAAGCACCGCACTACTCATTTCTGTTGCTTT
TAGTTCATCGATCGCATCGGCTGCTGAAGAAGCAAAAGAAAAATATTTAATTGGCTTTAATGAGCAGGAA
GCTGTTAGTGAGTTTGTAGAACAAGTAGAGGCAAATGACGAGGTCGCCATTCTCTCTGAGGAAGAGGAAG
TCGAAATTGAATTGCTTCATGAATTTGAAACGATTCCTGTTTTATCCGTTGAGTTAAGCCCAGAAGATGT
GGACGCGCTTGAACTCGATCCAGCGATTTCTTATATTGAAGAGGATGCAGAAGTAACGACAATGGCGCAA
TCAGTGCCATGGGGAATTAGCCGTGTGCAAGCCCCAGCTGCCCATAACCGTGGATTGACAGGTTCTGGTG
TAAAAGTTGCTGTCCTCGATACAGGTATTTCCACTCATCCAGACTTAAATATTCGTGGTGGCGCTAGCTT
TGTACCAGGGGAACCATCCACTCAAGATGGGAATGGGCATGGCACGCATGTGGCCGGGACGATTGCTGCT
TTAAACAATTCGATTGGCGTTCTTGGCGTAGCGCCGAGCGCGGAACTATACGCTGTTAAAGTATTAGGGG
CGAGCGGTTCAGGTTCGGTCAGCTCGATTGCCCAAGGATTGGAATGGGCAGGGAACAATGGCATGCACGT
TGCTAATTTGAGTTTAGGAAGCCCTTCGCCAAGTGCCACACTTGAGCAAGCTGTTAATAGCGCGACTTCT
AGAGGGGTTCTTGTTGTAGCGGCATCTGGGAATTCAGGTGCAGGCTCAATCAGCTATCCGGCCCGTTATG
CGAACGCAATGGCAGTCGGAGCGACTGACCAAAACAACAACCGCGCCAGCTTTTCACAGTATGGCGCAGG
GCTTGACATTGTCGCACCAGGTGTAAACGTGCAGAGCACATACCCAGGTTCAACGTATGCCAGCTTAAAC
GGTACATCGATGGCTACTCCTCATGTTGCAGGTGCAGCAGCCCTTGTTAAACAAAAGAACCCATCTTGGT
CCAATGTACAAATCCGCAATCATCTAAAGAATACGGCAACGAGCTTAGGAAGCACGAACTTGTATGGAAG
CGGACTTGTCAATGCAGAAGCGGCAACACGCTAATCAATAATAATAGGAGCTGTCCCAAAAGGTCATAGA
TAAATGACCTTTTGGGGTGGCTTTTTTACATTTGGATAAAAAAGCACAAAAAAATCGCCTCATCGTTTAA
AATGAAGGTACC

====BLASTN search====
Perform a BLAST search in the NR/NT database (BLASTN) using default settings. Remember to set Expect threshold back to the default value, 10. ('''2021 update:''' The new default is 0.05, that should work fine as well).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 3.1''':
:''(Once again remember to document your findings)''
:* Do we get any significant hits?
:* What kind of genes (function) do we find?

====BLASTP search====
Now let's try to do the same at the protein level.
* Find the longest ORF using [https://services.healthtech.dtu.dk/service.php?VirtualRibosome VirtualRibosome] (hint: remember to search all positive reading frames) and save of copy the sequence in FASTA format.
* BLAST the sequence (BLASTP) against the NR database.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 3.2''':
:(Document!)
:* Report your translated protein sequence in FASTA format.
:* Do we find any conserved protein domains? (''Click the Graphic Summary tab''). Identifying known protein domains can provide important clues to the function of an unknown protein.
:* Do we find any significant hits? (E-value?)
:* Are all the best hits the same category of enzymes?
:* From what you have seen, what is best for identifying intermediate quality hits - DNA or Protein BLAST?

 

===BLAST example 2===

In the previous section we have been cheating a bit by using a sequence that was already in the database - let's move on to the following sequence instead.

The sequence is a '''DNA fragment''' from an unknown non-cultivatable microorganism. It was cloned and sequenced directly from DNA extracted from a soil-sample, and it goes by the poetic name "CLONE12". It was amplified using degenerated PCR primers that target the middle ("core cloning") of the sequence of '''a group of known enzymes'''. (I can guarantee this particular sequence is not in the BLAST databases, since I have cloned and sequenced it myself, and it has never been submitted to GenBank).

LOCUS CLONE12.DNA 609 BP DS-DNA UPDATED 06/14/98
DEFINITION UWGCG file capture
ACCESSION -
KEYWORDS -
SOURCE -
COMMENT Non-sequence data from original file:
BASE COUNT 174 A 116 C 162 G 157 T 0 OTHER
ORIGIN ?
clone12.dna Length: 609 Jun 13, 1998 - 03:39 PM Check: 6014 ..
1 AACGGGCACG GGACGCATGT AGCTGGAACA GTGGCAGCCG TAAATAATAA TGGTATCGGA
61 GTTGCCGGGG TTGCAGGAGG AAACGGCTCT ACCAATAGTG GAGCAAGGTT AATGTCCACA
121 CAAATTTTTA ATAGTGATGG GGATTATACA AATAGCGAAA CTCTTGTGTA CAGAGCCATT
181 GTTTATGGTG CAGATAACGG AGCTGTGATC TCGCAAAATA GCTGGGGTAG TCAGTCTCTG
241 ACTATTAAGG AGTTGCAGAA AGCTGCGATC GACTATTTCA TTGATTATGC AGGAATGGAC
301 GAAACAGGAG AAATACAGAC AGGCCCTATG AGGGGAGGTA TATTTATAGC TGCCGCCGGA
361 AACGATAACG TTTCCACTCC AAATATGCCT TCAGCTTATG AACGGGTTTT AGCTGTGGCC
421 TCAATGGGAC CAGATTTTAC TAAGGCAAGC TATAGCACTT TTGGAACATG GACTGATATT
481 ACTGCTCCTG GCGGAGATAT TGACAAATTT GATTTGTCAG AATACGGAGT TCTCAGCACT
541 TATGCCGATA ATTATTATGC TTATGGAGAG GGAACATCCA TGGCTTGTCC ACATGTCGCC
601 GGCGCCGCC
//

[[Image:Office-notes-line_drawing.png|30px|left]][[Image:Cogs_brain.png|right|150px]]
:'''QUESTION 3.3 (Long question - read all)''':
: ''Your task is now to find out '''what kind of enzyme''' this sequence is likely to encode, '''using the methods''' you have learned''.

'''INSTRUCTIONS''': You are free to write the combined answer to this question in a free-style essay-like fashion - just be sure to include the subquestions in your answers. In an exam situation you will need to put all the clues together yourself, reason about the tools/databases to use, and document your findings.

'''STEP 1 - cleaning up the sequence''':

The sequence is (more or less) in GenBank format and the NCBI BLAST server expects the input to be in FASTA format, or to be "raw" unformatted sequence.

* There are two solutions to this:
** Copy the sequence into a text-editor and manually create a FASTA file ("search and replace" and/or "rectangular selection" is useful for the reformatting). This is the most robust solution: it will always work. (Look at the JEdit exercise for a reminder of how to do this).
** Hope the creators of the web-server you're using were kind enough to automatically remove non-DNA letters (paste in ONLY the DNA lines) - this turns out to be the case for both NCBI BLAST and VirtualRibosome, but it ''cannot be universally relied upon''.

'''Subquestion''': convert the sequence to FASTA format (manually, in JEdit) and quote it in your report.

'''STEP 2 - thinking about the task''':

Consider the following before you start on solving this task:
* Based on the information given: is the sequence protein-coding?
* If it is, can you trust it will contain both a START and STOP codon?
* Do we know if the sequence is sense or anti-sense?
and think which consequences the answers to these questions should have for your choice of methods and parameters.

'''Subquestion''': Give a summary of your considerations.

'''STEP 3 - Performing the database search''':

''Significance'': We will put the criteria for significance at 1e-10 (remember: the higher the E-value, the worse the significance).

'''Subquestion''':

Cover the following in your answer:
* What tool(s) and database(s) will be relevant to use?
* Document the results from the different BLAST searches - what works and what does not work?
* You need to copy in small snippets of the BLAST results to document what you observe.
* '''In conclusion''': What kind of enzyme is CLONE12? Gather as much evidence as possible.

 

==Part 4: BLAST'ing Genomes==
<div style="background-color: lavender; border: solid thin grey;">
:'''IMPORTANT:''' do '''NOT''' limit your search to "bacteria" here - now we are actively looking at organism specific queries.
</div>

So far we have been using BLAST to search in the big broad databases that covers at huge set of sequence from a large range of organisms. In this final part of the exercise we will be doing some more focused searches in smaller databases by targeting specific genomes.

Typically this will be useful if you have a gene of known function from one organism (say a cell-cycle controlling gene from Yeast, ''Saccharomyces cerevisiae'') and want to find the human homolog/ortholog to this gene (genes that control cell division are often involved in cancer).

When you have been performing the BLAST searches, you have probably already noticed, that's it possible to search specifically in the Human and Mouse genomes (these database only contains sequences from Human/Mouse). It's also possible to restrict the output from searches in the large databases (e.g. NR) to specific organisms.

A growing number of organisms have been fully sequenced, and the research teams responsible for a large scale genome project typically put up their own Web resources for accessing the data. For example the Yeast genome is principally hosted in the Saccharomyces Genome Database (SGD - www.yeastgenome.org) - it should be noted that SGD also offers BLAST as a means to search the database.



===Genome specific analysis of histones===
====SGD====
Let's do a small study of the relationship between the histones found in Yeast and in Human (evolutionary distance: ~1-1.5 billion years).

Look up the '''HTA2''' gene in SGD (http://www.yeastgenome.org - use the search box at the top of the page). Notice that a brief description about the function of the gene and its protein product is displayed (a huge amount of additional information can be found further down the page - much of it Yeast specific).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 4.1''':
:What information is given about the relationship between this gene and the gene "HTA1"?

Browse the page and locate the link to the protein sequence. Save the sequence as a file, '''we'll need it in a moment'''.

====NCBI====


Now return to the NCBI blastp page. Set Database to "Reference proteins (refseq_protein)", and enter <tt>Saccharomyces cerevisiae</tt> in the Organism field (and accept the suggestion with taxid:4932).

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 4.2''':
:''(Remember to document your answers)''
:* How many high-confidence hits do we get?
:* Do the hits make sense, from what you have read about HTA2 at the SGD webpage?
:'''Tip:''' click on the Gene links under Related Information (to the right of the alignments) to see the gene names for the protein hits.

The next step is to search the translated version of the human genome.

Do as before, still with Database set to "Reference proteins (refseq_protein)", just enter <tt>Human</tt> in the Organism field.

[[Image:Office-notes-line_drawing.png|30px|left]]
:'''QUESTION 4.3''':
:* How many high-confidence hits (with E-value better than 10-10) are found? (Approximately)
:* What are all the high-confidence hits called?




==Concluding remarks==
Today we have been using BLAST to find a number of homologous genes (and protein-products). If we want to go even deeper into the analysis of the homologs, the next logical step would be to build a dataset of the full-length versions of the sequences we have found (not just the part found by the local alignment in BLAST).

A further analysis could consist of a series of pairwise alignments (for finding out what is similar/different between pairs of sequences) or a multiple alignment which could form the basis of establishing the evolutionary relationship between the entire set of sequences.

BLAST can also be used as way to build a dataset of sequences base on a known "seed" sequence. As we saw in the GenBank exercise, free-text searching in the GenBank can be difficult, and if we for instance wanted to build a dataset of variants of the insulin gene, an easiy way to go around this would be to BLAST the normal version of the insulin against the sequence database of choice, and pick the best matching hits from here.

Bioinformatics in practice, Faroe Islands 2022

2024-03-15T12:07:31Z

WikiSysop: /* Afternoon: Phylogenetic trees */

This is the home page for week 44+45 of the "Bioinformatics in practice" course, Faroe Islands 2022. These five days are taught by [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] and [https://globe.ku.dk/staff-list/hologenomics/?id=271131&vis=medarbejder Bent Petersen].

== Wednesday November 2 ==
=== Morning: Introduction, plain text files and taxonomy databases ===
'''Lectures:'''
* ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Intro+bioinformaticsFO22.pdf Introduction to bioinformatics and computers] - Henrik''
* ''[https://teaching.healthtech.dtu.dk/material/22111/On_evolution_and_sequences_2020.pdf Evolution & taxonomy] - Bent''

'''Software for installation:'''

* '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). Download from here: https://www.oracle.com/java/technologies/downloads/#java17 (choose java 17, not 18 or 19, and select your type of computer) '''or''' from here: https://adoptium.net/ (choose Temurin JDK 17).
** '''NOTE:''' Do NOT download java from https://java.com/ — that will give you Oracle java 8, which is NOT good enough for jEdit anymore. jEdit version 5.6 and later needs java 11 or higher which is available from the above links (and from a few other places).
** '''IMPORTANT TIP''' for Windows users: You need to enable the sub-feature named "set JAVA_HOME variable" when installing Temurin JDK.
* jEdit: http://jedit.org/
** '''NOTE:''' The jEdit developers have not signed the installation package, therefore both Windows and MacOS X will complain when you first attempt to install it, and you have to insist that it is OK to run the program. For Macs, this is a bit complicated, see the instructions in [[ExJEdit#Download_and_Install_jEdit|the exercise guide]].
** '''IMPORTANT TIP''' for Mac users: If you have an M1 or M2 (ARM) Mac, you should use "Java-based installer" (.jar file) instead of "OS X package" (.dmg file).

'''Exercises:'''
* [[ExJEdit|jEdit]] — [[ExJEdit-Answers|Answers]]
* [[ExTaxonomy|Taxonomy databases]] — [[ExTaxonomy-Answers|Answers]]

=== Afternoon: GenBank ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Lecture_GenBank_F22_HN.pdf Biological information, DNA, sequencing, and GenBank] - Henrik''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/HandoutEx_BaseCalling_Simple.pdf‎ "Base-calling"] (PDF) / [https://teaching.healthtech.dtu.dk/material/22111/BaseCalling_on_screen_version.pdf On-screen version] (PDF)

'''Materials:'''
* [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf GenBank + FASTA format] (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/GeneStructure.pdf Eukaryotic gene structure overview] (PDF)

'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] — [[ExGenbank-new-answers|Answers]]

== Friday November 4 ==
=== Morning: Translation and UniProt ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Lecture_UniProt_2022.pdf Proteins: data and databases] - Henrik''

'''Materials:'''
* [https://teaching.healthtech.dtu.dk/material/22111/VirtualRibosome.pdf Virtual Ribosome] — software article (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/PDF/protein_handout.pdf Proteins, levels of structure] (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/GeneStructure.pdf Eukaryotic gene structure overview] (PDF)

'''Exercises:'''
* [[Exercise: Translation - Virtual Ribosome]] — [[ExTranslation-answers|Answers]]
* [[Exercise: The protein database UniProt]] — [[ExUniProt-answers|Answers]]

=== Afternoon: Protein structure, PDB & PyMOL ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/22111-ProteinStructure_2022_CBQ-reduced.pdf Protein 3D structure] - Bent''

'''Software''' for installation: [https://pymol.org/2/ PyMOL]
: '''Note:''' you will need a license file which we will provide via Zoom.

'''Exercises:'''
* [https://teaching.healthtech.dtu.dk/material/22111/PyMol_tutorial2017_v4.pdf PyMol tutorial] (PDF)
* [[Protein Structure and Visualization]] — [[Protein_Structure_and_Visualization_Answers|Answers]] (NB: The section "PyMOL magic" is ''not'' part of your curriculum, just a handy tip!)

== Monday November 7 ==
=== Morning: Pairwise alignment ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/PairwiseAlignment2.pdf Pairwise alignment] - Henrik''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/New_handout_alignscores.pdf Alignment scores]

'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] — [[ExPairwiseAlignment-Answers|Answers]]

'''Wrap-up:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Pairwise_alignment_revisited.pdf Pairwise alignment revisited]''

=== Afternoon: Sequence database searching with BLAST ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Lecture_BLAST_2020.pdf Introduction to BLAST] - Bent''

'''Exercise:''' [[Exercise: BLAST]] — [[ExBlast-Answers|Answers]]

'''Wrap-up:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Blastn_vs_Blastp.2017.pdf BLASTN vs BLASTP]''

== Wednesday November 9 ==
=== Morning: Sequence information and logo plots ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Logos_2022.pdf Sequence information and logo plots] - Henrik''

'''Materials:''' "''Information theory primer''" ([https://teaching.healthtech.dtu.dk/material/22111/PDF/informationtheory_primer.pdf PDF])

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/Logo_exercise.pdf How to construct sequence logos] — [https://teaching.healthtech.dtu.dk/material/22111/FO2022/Ex_Logo_ans.pdf Answer]

'''Exercise:''' [[ExSeqLogos|DNA and Peptide LOGOs]] — [[ExSeqLogosAnswers|Answers]]

=== Afternoon: Profile searching with PSI-BLAST ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/PSI-BLAST_FO2022.pdf PSI-BLAST] - Bent''

'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] — [[ExPSIBLAST_answer|Answers]]

== Friday November 11 ==
=== Morning: Multiple alignments ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/hnielsen_mulalign.pdf Multiple Alignments] - Henrik''

'''Materials:''' RevTrans (article, [https://teaching.healthtech.dtu.dk/material/22111/FO2022/RevTrans.pdf PDF])

'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] — [[ExMulAlign-Answers-English|Answers]]

=== Afternoon: Phylogenetic trees ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Gorm_phylogeny_newbackground_rwe_version.pdf Phylogenetic trees] - Bent''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/FO2022/handout_distance.pdf Reconstruction of distance tree]

'''Software for installation:''' [https://github.com/rambaut/figtree/releases FigTree tree-viewer]
:'''IMPORTANT NOTE''' for Windows users: Download the <tt>.zip</tt> file (FigTree.v1.4.4.zip) and unpack it. Then, go to the "lib" subfolder and double-click the <tt>.jar</tt> file. The <tt>.exe</tt> file may not work.

'''Exercise:''' [[Exercise: Phylogeny]] — [[Exercise:_Phylogeny-Answers|Answers]]

Bioinformatics in practice, Faroe Islands 2022

2024-03-15T12:05:12Z

WikiSysop: /* Afternoon: GenBank */

This is the home page for week 44+45 of the "Bioinformatics in practice" course, Faroe Islands 2022. These five days are taught by [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] and [https://globe.ku.dk/staff-list/hologenomics/?id=271131&vis=medarbejder Bent Petersen].

== Wednesday November 2 ==
=== Morning: Introduction, plain text files and taxonomy databases ===
'''Lectures:'''
* ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Intro+bioinformaticsFO22.pdf Introduction to bioinformatics and computers] - Henrik''
* ''[https://teaching.healthtech.dtu.dk/material/22111/On_evolution_and_sequences_2020.pdf Evolution & taxonomy] - Bent''

'''Software for installation:'''

* '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). Download from here: https://www.oracle.com/java/technologies/downloads/#java17 (choose java 17, not 18 or 19, and select your type of computer) '''or''' from here: https://adoptium.net/ (choose Temurin JDK 17).
** '''NOTE:''' Do NOT download java from https://java.com/ — that will give you Oracle java 8, which is NOT good enough for jEdit anymore. jEdit version 5.6 and later needs java 11 or higher which is available from the above links (and from a few other places).
** '''IMPORTANT TIP''' for Windows users: You need to enable the sub-feature named "set JAVA_HOME variable" when installing Temurin JDK.
* jEdit: http://jedit.org/
** '''NOTE:''' The jEdit developers have not signed the installation package, therefore both Windows and MacOS X will complain when you first attempt to install it, and you have to insist that it is OK to run the program. For Macs, this is a bit complicated, see the instructions in [[ExJEdit#Download_and_Install_jEdit|the exercise guide]].
** '''IMPORTANT TIP''' for Mac users: If you have an M1 or M2 (ARM) Mac, you should use "Java-based installer" (.jar file) instead of "OS X package" (.dmg file).

'''Exercises:'''
* [[ExJEdit|jEdit]] — [[ExJEdit-Answers|Answers]]
* [[ExTaxonomy|Taxonomy databases]] — [[ExTaxonomy-Answers|Answers]]

=== Afternoon: GenBank ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Lecture_GenBank_F22_HN.pdf Biological information, DNA, sequencing, and GenBank] - Henrik''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/HandoutEx_BaseCalling_Simple.pdf‎ "Base-calling"] (PDF) / [https://teaching.healthtech.dtu.dk/material/22111/BaseCalling_on_screen_version.pdf On-screen version] (PDF)

'''Materials:'''
* [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf GenBank + FASTA format] (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/GeneStructure.pdf Eukaryotic gene structure overview] (PDF)

'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] — [[ExGenbank-new-answers|Answers]]

== Friday November 4 ==
=== Morning: Translation and UniProt ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Lecture_UniProt_2022.pdf Proteins: data and databases] - Henrik''

'''Materials:'''
* [https://teaching.healthtech.dtu.dk/material/22111/VirtualRibosome.pdf Virtual Ribosome] — software article (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/PDF/protein_handout.pdf Proteins, levels of structure] (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/GeneStructure.pdf Eukaryotic gene structure overview] (PDF)

'''Exercises:'''
* [[Exercise: Translation - Virtual Ribosome]] — [[ExTranslation-answers|Answers]]
* [[Exercise: The protein database UniProt]] — [[ExUniProt-answers|Answers]]

=== Afternoon: Protein structure, PDB & PyMOL ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/22111-ProteinStructure_2022_CBQ-reduced.pdf Protein 3D structure] - Bent''

'''Software''' for installation: [https://pymol.org/2/ PyMOL]
: '''Note:''' you will need a license file which we will provide via Zoom.

'''Exercises:'''
* [https://teaching.healthtech.dtu.dk/material/22111/PyMol_tutorial2017_v4.pdf PyMol tutorial] (PDF)
* [[Protein Structure and Visualization]] — [[Protein_Structure_and_Visualization_Answers|Answers]] (NB: The section "PyMOL magic" is ''not'' part of your curriculum, just a handy tip!)

== Monday November 7 ==
=== Morning: Pairwise alignment ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/PairwiseAlignment2.pdf Pairwise alignment] - Henrik''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/New_handout_alignscores.pdf Alignment scores]

'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] — [[ExPairwiseAlignment-Answers|Answers]]

'''Wrap-up:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Pairwise_alignment_revisited.pdf Pairwise alignment revisited]''

=== Afternoon: Sequence database searching with BLAST ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Lecture_BLAST_2020.pdf Introduction to BLAST] - Bent''

'''Exercise:''' [[Exercise: BLAST]] — [[ExBlast-Answers|Answers]]

'''Wrap-up:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Blastn_vs_Blastp.2017.pdf BLASTN vs BLASTP]''

== Wednesday November 9 ==
=== Morning: Sequence information and logo plots ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Logos_2022.pdf Sequence information and logo plots] - Henrik''

'''Materials:''' "''Information theory primer''" ([https://teaching.healthtech.dtu.dk/material/22111/PDF/informationtheory_primer.pdf PDF])

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/Logo_exercise.pdf How to construct sequence logos] — [https://teaching.healthtech.dtu.dk/material/22111/FO2022/Ex_Logo_ans.pdf Answer]

'''Exercise:''' [[ExSeqLogos|DNA and Peptide LOGOs]] — [[ExSeqLogosAnswers|Answers]]

=== Afternoon: Profile searching with PSI-BLAST ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/PSI-BLAST_FO2022.pdf PSI-BLAST] - Bent''

'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] — [[ExPSIBLAST_answer|Answers]]

== Friday November 11 ==
=== Morning: Multiple alignments ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/hnielsen_mulalign.pdf Multiple Alignments] - Henrik''

'''Materials:''' RevTrans (article, [https://teaching.healthtech.dtu.dk/material/22111/FO2022/RevTrans.pdf PDF])

'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] — [[ExMulAlign-Answers-English|Answers]]

=== Afternoon: Phylogenetic trees ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/F02022/Gorm_phylogeny_newbackground_rwe_version.pdf Phylogenetic trees] - Bent''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/FO2022/handout_distance.pdf Reconstruction of distance tree]

'''Software for installation:''' [https://github.com/rambaut/figtree/releases FigTree tree-viewer]
:'''IMPORTANT NOTE''' for Windows users: Download the <tt>.zip</tt> file (FigTree.v1.4.4.zip) and unpack it. Then, go to the "lib" subfolder and double-click the <tt>.jar</tt> file. The <tt>.exe</tt> file may not work.

'''Exercise:''' [[Exercise: Phylogeny]] — [[Exercise:_Phylogeny-Answers|Answers]]

Bioinformatics in practice, Faroe Islands 2022

2024-03-15T12:03:38Z

WikiSysop:

This is the home page for week 44+45 of the "Bioinformatics in practice" course, Faroe Islands 2022. These five days are taught by [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] and [https://globe.ku.dk/staff-list/hologenomics/?id=271131&vis=medarbejder Bent Petersen].

== Wednesday November 2 ==
=== Morning: Introduction, plain text files and taxonomy databases ===
'''Lectures:'''
* ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Intro+bioinformaticsFO22.pdf Introduction to bioinformatics and computers] - Henrik''
* ''[https://teaching.healthtech.dtu.dk/material/22111/On_evolution_and_sequences_2020.pdf Evolution & taxonomy] - Bent''

'''Software for installation:'''

* '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). Download from here: https://www.oracle.com/java/technologies/downloads/#java17 (choose java 17, not 18 or 19, and select your type of computer) '''or''' from here: https://adoptium.net/ (choose Temurin JDK 17).
** '''NOTE:''' Do NOT download java from https://java.com/ — that will give you Oracle java 8, which is NOT good enough for jEdit anymore. jEdit version 5.6 and later needs java 11 or higher which is available from the above links (and from a few other places).
** '''IMPORTANT TIP''' for Windows users: You need to enable the sub-feature named "set JAVA_HOME variable" when installing Temurin JDK.
* jEdit: http://jedit.org/
** '''NOTE:''' The jEdit developers have not signed the installation package, therefore both Windows and MacOS X will complain when you first attempt to install it, and you have to insist that it is OK to run the program. For Macs, this is a bit complicated, see the instructions in [[ExJEdit#Download_and_Install_jEdit|the exercise guide]].
** '''IMPORTANT TIP''' for Mac users: If you have an M1 or M2 (ARM) Mac, you should use "Java-based installer" (.jar file) instead of "OS X package" (.dmg file).

'''Exercises:'''
* [[ExJEdit|jEdit]] — [[ExJEdit-Answers|Answers]]
* [[ExTaxonomy|Taxonomy databases]] — [[ExTaxonomy-Answers|Answers]]

=== Afternoon: GenBank ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Lecture_GenBank_F22_HN.pdf Biological information, DNA, sequencing, and GenBank] - Henrik''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/HandoutEx_BaseCalling_Simple.pdf‎ "Base-calling"] (PDF) / [https://teaching.healthtech.dtu.dk/material/22111/BaseCalling_on_screen_version.pdf On-screen version] (PDF)

'''Materials:'''
* [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf GenBank + FASTA format] (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/GeneStructure.pdf|Eukaryotic gene structure overview] (PDF)

'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] — [[ExGenbank-new-answers|Answers]]

== Friday November 4 ==
=== Morning: Translation and UniProt ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Lecture_UniProt_2022.pdf Proteins: data and databases] - Henrik''

'''Materials:'''
* [https://teaching.healthtech.dtu.dk/material/22111/VirtualRibosome.pdf Virtual Ribosome] — software article (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/PDF/protein_handout.pdf Proteins, levels of structure] (PDF)
* [https://teaching.healthtech.dtu.dk/material/22111/GeneStructure.pdf Eukaryotic gene structure overview] (PDF)

'''Exercises:'''
* [[Exercise: Translation - Virtual Ribosome]] — [[ExTranslation-answers|Answers]]
* [[Exercise: The protein database UniProt]] — [[ExUniProt-answers|Answers]]

=== Afternoon: Protein structure, PDB & PyMOL ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/22111-ProteinStructure_2022_CBQ-reduced.pdf Protein 3D structure] - Bent''

'''Software''' for installation: [https://pymol.org/2/ PyMOL]
: '''Note:''' you will need a license file which we will provide via Zoom.

'''Exercises:'''
* [https://teaching.healthtech.dtu.dk/material/22111/PyMol_tutorial2017_v4.pdf PyMol tutorial] (PDF)
* [[Protein Structure and Visualization]] — [[Protein_Structure_and_Visualization_Answers|Answers]] (NB: The section "PyMOL magic" is ''not'' part of your curriculum, just a handy tip!)

== Monday November 7 ==
=== Morning: Pairwise alignment ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/PairwiseAlignment2.pdf Pairwise alignment] - Henrik''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/New_handout_alignscores.pdf Alignment scores]

'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] — [[ExPairwiseAlignment-Answers|Answers]]

'''Wrap-up:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Pairwise_alignment_revisited.pdf Pairwise alignment revisited]''

=== Afternoon: Sequence database searching with BLAST ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Lecture_BLAST_2020.pdf Introduction to BLAST] - Bent''

'''Exercise:''' [[Exercise: BLAST]] — [[ExBlast-Answers|Answers]]

'''Wrap-up:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/Blastn_vs_Blastp.2017.pdf BLASTN vs BLASTP]''

== Wednesday November 9 ==
=== Morning: Sequence information and logo plots ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/Logos_2022.pdf Sequence information and logo plots] - Henrik''

'''Materials:''' "''Information theory primer''" ([https://teaching.healthtech.dtu.dk/material/22111/PDF/informationtheory_primer.pdf PDF])

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/Logo_exercise.pdf How to construct sequence logos] — [https://teaching.healthtech.dtu.dk/material/22111/FO2022/Ex_Logo_ans.pdf Answer]

'''Exercise:''' [[ExSeqLogos|DNA and Peptide LOGOs]] — [[ExSeqLogosAnswers|Answers]]

=== Afternoon: Profile searching with PSI-BLAST ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2022/PSI-BLAST_FO2022.pdf PSI-BLAST] - Bent''

'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] — [[ExPSIBLAST_answer|Answers]]

== Friday November 11 ==
=== Morning: Multiple alignments ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/FO2018/hnielsen_mulalign.pdf Multiple Alignments] - Henrik''

'''Materials:''' RevTrans (article, [https://teaching.healthtech.dtu.dk/material/22111/FO2022/RevTrans.pdf PDF])

'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] — [[ExMulAlign-Answers-English|Answers]]

=== Afternoon: Phylogenetic trees ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/22111/F02022/Gorm_phylogeny_newbackground_rwe_version.pdf Phylogenetic trees] - Bent''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/22111/FO2022/handout_distance.pdf Reconstruction of distance tree]

'''Software for installation:''' [https://github.com/rambaut/figtree/releases FigTree tree-viewer]
:'''IMPORTANT NOTE''' for Windows users: Download the <tt>.zip</tt> file (FigTree.v1.4.4.zip) and unpack it. Then, go to the "lib" subfolder and double-click the <tt>.jar</tt> file. The <tt>.exe</tt> file may not work.

'''Exercise:''' [[Exercise: Phylogeny]] — [[Exercise:_Phylogeny-Answers|Answers]]

22111 - Introduction to Bioinformatics

2024-03-15T11:51:03Z

WikiSysop: /* Course programme */

'''Formerly known as 36611 and 27611'''

== Practical information ==
DTU's Studies Handbook about [http://www.kurser.dtu.dk/course/22111 #22111]

This course is '''taught in English from 2019''' (previously taught in Danish), and it is a practically oriented, introductory 3rd semester (previously 4th semester) course. All students from DTU and other universities are welcome.

For more information, please contact: Associate Professor [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] ([mailto:henni@dtu.dk henni@dtu.dk]), Assistant Professor [https://www.dtu.dk/Person/cwis?id=142840&entity=profile Carolina Barra Quaglia] ([mailto:carolet@dtu.dk carolet@dtu.dk]), or Course Coordinator [https://www.dtu.dk/service/telefonbog/person?id=136144&cpid=246063&tab=0 Antonia Celinah Majlund Bjørstorp] ([mailto:acmb@dtu.dk acmb@dtu.dk]).

If you want to participate in the course, please sign up through the Studies Division ("studiekontoret") at DTU.
If you are not enrolled at the Technical University of Denmark, you have to sign up as a guest student (more information here: [http://www.dtu.dk/Uddannelse/Gaestestuderende.aspx General Information for guest students from other Danish Universities])

== Course programme ==
'''Current:'''
* [[22111: Course plan autumn 2023]]

'''Previous:'''
* [[22111: Course plan autumn 2022]]
* [[22111: Course plan spring 2022]]
* [[22111: Course plan spring 2021]]
* [[22111: Course plan spring 2020]]
* [[22111: Course plan spring 2019]]
* [[22111:Kursusplan for forår 2018]] (Course Programme Spring 2018, in Danish)
* [[27611: Kursusplan for forår 2017]] (Course Programme Spring 2017, in Danish)

'''Special editions:'''
* [[Bioinformatics in practice, Faroe Islands 2022]]

Bioinformatics in practice, Faroe Islands 2022

2024-03-15T11:50:38Z

WikiSysop: Created page with "This is the home page for week 44+45 of the "Bioinformatics in practice" course, Faroe Islands 2022. These five days are taught by [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] and [https://globe.ku.dk/staff-list/hologenomics/?id=271131&vis=medarbejder Bent Petersen]. == Wednesday November 2 == === Morning: Introduction, plain text files and taxonomy databases === '''Lectures:''' * ''[https://teaching.he..."

This is the home page for week 44+45 of the "Bioinformatics in practice" course, Faroe Islands 2022. These five days are taught by [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] and [https://globe.ku.dk/staff-list/hologenomics/?id=271131&vis=medarbejder Bent Petersen].

== Wednesday November 2 ==
=== Morning: Introduction, plain text files and taxonomy databases ===
'''Lectures:'''
* ''[https://teaching.healthtech.dtu.dk/material/36611/FO2022/Intro+bioinformaticsFO22.pdf Introduction to bioinformatics and computers] - Henrik''
* ''[https://teaching.healthtech.dtu.dk/22111/images/c/c1/On_evolution_and_sequences_2020.pdf Evolution & taxonomy] - Bent''

'''Software for installation:'''

* '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). Download from here: https://www.oracle.com/java/technologies/downloads/#java17 (choose java 17, not 18 or 19, and select your type of computer) '''or''' from here: https://adoptium.net/ (choose Temurin JDK 17).
** '''NOTE:''' Do NOT download java from https://java.com/ — that will give you Oracle java 8, which is NOT good enough for jEdit anymore. jEdit version 5.6 and later needs java 11 or higher which is available from the above links (and from a few other places).
** '''IMPORTANT TIP''' for Windows users: You need to enable the sub-feature named "set JAVA_HOME variable" when installing Temurin JDK.
* jEdit: http://jedit.org/
** '''NOTE:''' The jEdit developers have not signed the installation package, therefore both Windows and MacOS X will complain when you first attempt to install it, and you have to insist that it is OK to run the program. For Macs, this is a bit complicated, see the instructions in [[ExJEdit#Download_and_Install_jEdit|the exercise guide]].
** '''IMPORTANT TIP''' for Mac users: If you have an M1 or M2 (ARM) Mac, you should use "Java-based installer" (.jar file) instead of "OS X package" (.dmg file).

'''Exercises:'''
* [[ExJEdit|jEdit]] — [[ExJEdit-Answers|Answers]]
* [[ExTaxonomy|Taxonomy databases]] — [[ExTaxonomy-Answers|Answers]]

=== Afternoon: GenBank ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/36611/FO2022/Lecture_GenBank_F22_HN.pdf Biological information, DNA, sequencing, and GenBank] - Henrik''

'''Handout exercise:''' [[Media:HandoutEx_BaseCalling_Simple.pdf‎|"Base-calling"]] (PDF) / [[Media:BaseCalling_on_screen_version.pdf|On-screen version]] (PDF)

'''Materials:'''
* [[Media:GenBank+FASTA_handout_revised.pdf|GenBank + FASTA format]] (PDF)
* [[Media:GeneStructure.pdf|Eukaryotic gene structure overview]] (PDF)

'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] — [[ExGenbank-new-answers|Answers]]

== Friday November 4 ==
=== Morning: Translation and UniProt ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/36611/FO2022/Lecture_UniProt_2022.pdf Proteins: data and databases] - Henrik''

'''Materials:'''
* [[Media:VirtualRibosome.pdf|Virtual Ribosome]] — software article (PDF)
* [http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/protein_handout.pdf Proteins, levels of structure] (PDF)
* [[Media:GeneStructure.pdf|Eukaryotic gene structure overview]] (PDF)

'''Exercises:'''
* [[Exercise: Translation - Virtual Ribosome]] — [[ExTranslation-answers|Answers]]
* [[Exercise: The protein database UniProt]] — [[ExUniProt-answers|Answers]]

=== Afternoon: Protein structure, PDB & PyMOL ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/36611/FO2022/22111-ProteinStructure_2022_CBQ-reduced.pdf Protein 3D structure] - Bent''

'''Software''' for installation: [https://pymol.org/2/ PyMOL]
: '''Note:''' you will need a license file which we will provide via Zoom.

'''Exercises:'''
* [[Media:PyMol_tutorial2017_v4.pdf|PyMol tutorial]] (PDF)
* [[Protein Structure and Visualization]] — [[Protein_Structure_and_Visualization_Answers|Answers]] (NB: The section "PyMOL magic" is ''not'' part of your curriculum, just a handy tip!)

== Monday November 7 ==
=== Morning: Pairwise alignment ===
'''Lecture:''' ''[http://teaching.healthtech.dtu.dk/material/36611/FO2018/PairwiseAlignment2.pdf Pairwise alignment] - Henrik''

'''Handout exercise:''' [[Media:New_handout_alignscores.pdf|Alignment scores]]

'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] — [[ExPairwiseAlignment-Answers|Answers]]

'''Wrap-up:''' ''[http://teaching.healthtech.dtu.dk/material/36611/FO2018/Pairwise_alignment_revisited.pdf Pairwise alignment revisited]''

=== Afternoon: Sequence database searching with BLAST ===
'''Lecture:''' ''[http://teaching.healthtech.dtu.dk/material/36611/FO2018/Lecture_BLAST_2020.pdf Introduction to BLAST] - Bent''

'''Exercise:''' [[Exercise: BLAST]] — [[ExBlast-Answers|Answers]]

'''Wrap-up:''' ''[http://teaching.healthtech.dtu.dk/material/36611/FO2018/Blastn_vs_Blastp.2017.pdf BLASTN vs BLASTP]''

== Wednesday November 9 ==
=== Morning: Sequence information and logo plots ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/36611/FO2022/Logos_2022.pdf Sequence information and logo plots] - Henrik''

'''Materials:''' "''Information theory primer''" ([https://teaching.healthtech.dtu.dk/material/36611/PDF/informationtheory_primer.pdf PDF])

'''Handout exercise:''' [[Media:logo_exercise.pdf|How to construct sequence logos]] — [https://teaching.healthtech.dtu.dk/material/36611/FO2022/Ex_Logo_ans.pdf Answer]

'''Exercise:''' [[ExSeqLogos|DNA and Peptide LOGOs]] — [[ExSeqLogosAnswers|Answers]]

=== Afternoon: Profile searching with PSI-BLAST ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/material/36611/FO2022/PSI-BLAST_FO2022.pdf PSI-BLAST] - Bent''

'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] — [[ExPSIBLAST_answer|Answers]]

== Friday November 11 ==
=== Morning: Multiple alignments ===
'''Lecture:''' ''[http://teaching.healthtech.dtu.dk/material/36611/FO2018/hnielsen_mulalign.pdf Multiple Alignments] - Henrik''

'''Materials:''' RevTrans (article, [https://teaching.healthtech.dtu.dk/material/36611/FO2022/RevTrans.pdf PDF])

'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] — [[ExMulAlign-Answers-English|Answers]]

=== Afternoon: Phylogenetic trees ===
'''Lecture:''' ''[https://teaching.healthtech.dtu.dk/22111/images/5/55/Gorm_phylogeny_newbackground_rwe_version.pdf Phylogenetic trees] - Bent''

'''Handout exercise:''' [https://teaching.healthtech.dtu.dk/material/36611/FO2022/handout_distance.pdf Reconstruction of distance tree]

'''Software for installation:''' [https://github.com/rambaut/figtree/releases FigTree tree-viewer]
:'''IMPORTANT NOTE''' for Windows users: Download the <tt>.zip</tt> file (FigTree.v1.4.4.zip) and unpack it. Then, go to the "lib" subfolder and double-click the <tt>.jar</tt> file. The <tt>.exe</tt> file may not work.

'''Exercise:''' [[Exercise: Phylogeny]] — [[Exercise:_Phylogeny-Answers|Answers]]

27611: Kursusplan for forår 2017

2024-03-15T11:49:55Z

WikiSysop: Created page with "'''NB:''' Dette er en ''foreløbig'' kursusplan; ændringer kan forekomme! ==Generel information== ===Undervisere / forelæsere=== * [http://www.cbs.dtu.dk/staff/show-staff.php?id=535 Henrik Nielsen] — Lektor, kursusansvarlig. * [http://www.cbs.dtu.dk/staff/show-staff.php?id=758 Bent Petersen] — Lektor, kursusansvarlig. * [http://www.cbs.dtu.dk/~raz/ Rasmus Wernersson] — Ekstern lektor, kursusansvarlig. * [http://www.cbs.dtu.dk/staff/show-staff.php..."

'''NB:''' Dette er en ''foreløbig'' kursusplan; ændringer kan forekomme!

==Generel information==

===Undervisere / forelæsere===

* [http://www.cbs.dtu.dk/staff/show-staff.php?id=535 Henrik Nielsen] — Lektor, kursusansvarlig.
* [http://www.cbs.dtu.dk/staff/show-staff.php?id=758 Bent Petersen] — Lektor, kursusansvarlig.
* [http://www.cbs.dtu.dk/~raz/ Rasmus Wernersson] — Ekstern lektor, kursusansvarlig.
* [http://www.cbs.dtu.dk/staff/show-staff.php?id=1082 Paolo Marcatili] — Adjunkt, gæstelærer. Emne: Proteinstruktur.
* Jens Emil Vang Petersen — PhD-studerende, Københavns Universitet, gæstelærer. Emne: Malariavacciner. 
* [http://www.cbs.dtu.dk/staff/show-staff.php?id=525 Anders Gorm Pedersen] — Professor, gæstelærer. Emne: Evolutionære træer.

===Assistenter ved øvelser===
* Monica Hannani — hjælpelærer.
* Malene Revsbech Christiansen — hjælpelærer.



===Indhold===
I dette kursus er der lagt en stor vægt på praktisk anvendelse af de bioinformatiske værktøjer. En typisk lektion vil blive indledt med en teoretisk gennemgang af dagens emne (incl. nogle mindre øvelser/gruppearbejde) på en lille times tid, og resten af tiden vil blive brugt til praktiske øvelser på computer.

Se i øvrigt [http://www.kurser.dtu.dk/courses/27611/default.aspx kursusbasen om 27611].

===Pensum===
Udleverede noter og øvelsesmateriale: Der en ikke en formel lærebog. Der vil løbende blive udleveret kompendiemateriale; typisk i form af PDF filer på hjemmesiden. Vær opmærksom på at alle øvelsesvejledninger er pensum — og det gælder også svarene til øvelserne, som bliver lagt på hjemmesiden efter hver øvelsesgang!

===Computere===
'''I skal SELV medbringe bærbare computere''' der kan kobles til DTU's trådløse netværk. Typen af computer / operativsystem er ikke vigtigt — Windows, Mac og Linux vil alle virke fint.

Til øvelserne i "PDB & PyMol" samt "Malariavaccine" '''skal I medbringe en mus'''. Musen skal have tre knapper, hvoraf den midterste skal være et scroll-hjul.

Software:
# En moderne Internet Browser (fx. [http://www.google.com/chrome Google Chrome], [http://www.mozilla.com/ FireFox], [http://www.opera.com/ Opera], Safari for Mac eller Internet Explorer / Edge for Windows). '''NB''': Du ''skal'' have FireFox på dit system, da det er den eneste browser, som både er cross-platform (fås til både Windows, Mac og Linux) og er i stand til at køre Java-applets, som vi skal bruge i en af øvelserne. Internet Explorer eller Edge (indbygget i Windows) og Safari (indbygget i Mac) kan have knas med visse bioinformatik-websites, og så er det vigtigt at kunne skifte til en alternativ browser, der virker.
# JAVA: JAVA er nødvendigt at køre nogle af de programmer, vi skal bruge undervejs, bl.a. jEdit (se nedenfor). Java kan hentes gratis her: http://www.java.com — hvis det ikke allerede er installeret på din computer.
# JEdit: I kurset vil vi flere gange bruge JEdit til at kigge på tekst-baserede sekvensfiler — du kan med fordel installere den før første kursus-gang (gratis program): http://www.jedit.org. Hvis man får uløselige problemer med at få jEdit til at køre, så er et godt alternativ [http://geany.org/ Geany].

Øvrig software installeres i løbet af øvelserne.

===Hvor og hvornår===
Kurset består af forelæsning + efterfølgende øvelser, begge dele tirsdag eftermiddag. Forelæsningerne afholdes fra 13:00-14:00 (cirka) i '''bygning 208, auditorium 51'''. Øvelserne afholdes efterfølgende i bygning 210, holdlokalerne 042+048 samt grupperummene 066+068+070+072.

Første undervisningsgang er '''tirsdag den 31. januar'''.

===Afleveringer===
Som træning til den computer-baserede eksamen, skal hver gruppe skrive en "logbog" med svar på de spørgsmål der stilles i øvelserne (man må gerne skrive på dansk selv om teksten i øvelserne er på engelsk). Efter øvelsen skal I uploade jeres svar til CampusNet ('''Kursus 27611 → Opgaver''').

Det er muligt at aflevere som en gruppe. Vi vil meget hellere have en gruppeaflevering end et antal identiske besvarelser. Men husk at skrive alle gruppemedlemmers navne i dokumentet.

I må selv bestemme hvilket program I bruger til at skrive logbogen — f.eks. Microsoft Word, [http://www.libreoffice.org/ LibreOffice] (gratis), [http://www.openoffice.org/ Apache OpenOffice] (gratis), Pages til Mac eller lignende. Det er en fordel hvis I kan indsætte screenshots til at dokumentere hvad I har lavet. Microsoft Word har et indbygget screenshot-værktøj. Til Windows-brugere anbefaler vi i øvrigt det gratis program [http://getgreenshot.org/ Greenshot] til at tage screenshots og lave mindre redigeringer i dem.

Men uanset hvad I bruger, '''skal resultatet afleveres som PDF'''. Både Mac og Windows 10 har indbyggede funktioner til at konvertere alle dokumenter, der kan printes, til PDF. Hvis man har en tidligere version af Windows, må man installere et separat program. Der findes flere gratis alternativer; vi anbefaler [http://www.primopdf.com/ PrimoPDF]. (Det kan godt være en god ide at installere PrimoPDF, selv om man bruger Windows 10, det giver nogle flere muligheder, og de resulterende filer fylder mindre).

Vær venlige ''ikke'' at kopiere opgaveteksten i besvarelsen. Opgaveafleveringen på CampusNet har et system til detektion af plagiering, som giver udslag hvis der står en kopi af opgaveteksten i besvarelsen.

NB: '''Afleveringerne har ikke nogen indflydelse på jeres karakter''' — de er ment som en øvelse i brug af det system vi også skal bruge til eksamen. De er desuden en måde for os til at kontrollere forståelsen af undervisningen: hvis der er en bestemt fejl, som rigtig mange har lavet, kan vi måske forklare det bedre til næste undervisningsgang.

===Eksamen===
Eksamen i 27611 er elektronisk — d.v.s. at I skal medbringe egen computer og at I ''ikke'' får udleveret opgavesættet på papir. Opgavesættet kommer til at ligge som en PDF-fil på CampusNet. Afleveringen foregår også på CampusNet, lige som opgaveafleveringerne i løbet af kurset. Der skal afleveres i PDF.

Eksamen er med alle hjælpemidler og åbent internet. I må gerne medbringe bøger, artikler o.lign. Desuden har I via internettet adgang til alle de materialer, vi har brugt under kurset. I må også gerne søge information på Google, Wikipedia o.s.v. — I må bare ikke kommunikere med andre via email, Facebook, chat el.lign.

Ligesom i opgaveafleveringerne må vi bede om at man ''ikke'' kopierer opgaveteksten i besvarelsen. På den måde undgår man at besvarelsen bliver automatisk markeret som plagiering.

Når man afleverer sin eksamensbesvarelse på CampusNet, får man en kode som skal afleveres på papir til eksamensvagten. '''Det er meget vigtigt at koden bliver afleveret korrekt, ellers kan besvarelsen ikke godkendes.''' Koden bliver ændret, hvis man uploader en ny version af sin besvarelse; den er således en kontrol af, at man ikke har ændret i sin besvarelse, efter at man har forladt eksamenslokalet.

== CampusNet ==
Link til CampusNet gruppe for dette års kursus: https://cn.inside.dtu.dk/cnnet/element/534869/frontpage

== Løbende evaluering og feedback ==
Vi modtager meget gerne kommentarer, forslag, kritik, ros mm. til undervisningen og undervisningsmaterialerne nårsomhelst. Du kan gøre dette enten pr. mail til lærerne eller ved at skrive en meddelelse på [https://cn.inside.dtu.dk/cnnet/element/534869/messages "Frit forum" i CampusNet]. Du kan også svare på andres indlæg. Hvis der er et indlæg, du er enig i, så skriv meget gerne kommentaren "Enig!", så vi ved at der er flere, der mener det samme.

Desuden planlægger vi at holde en midtvejsevaluering i løbet af semesteret, ligeledes i CampusNet.

== Lektionsplan ==

=== Tirsdag 31/1 — Introduktion og Taksonomidatabaser ===

:'''Forelæsninger:'''
:* ''Introduktion til kurset, bioinformatik og computere'' — Henrik Nielsen.
:* ''Evolution & Taksonomi'' — Rasmus Wernersson.
:'''Pensum:''' [http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/Chapter2_Evolution.pdf Brief Introduction to Evolutionary Theory] — Skrevet/redigeret af Anders Gorm Pedersen.
:'''Slides:''' (Bliver lagt på CampusNet under "fildeling")
:'''Test af forhåndskundskaber:''' Gå til https://evaluering.dtu.dk/ , klik på "Test af forhåndskundskaber" under 27611 og udfyld skemaet (det er anonymt). Brug max. 10 minutter på det.
:'''Øvelser:'''
:# [[ExJEdit|JEdit]] - ([[ExJEdit-Answers|Svar til øvelsen]])
:# [[ExTaxonomy|Taksonomidatabaser]] - ([[ExTaxonomy-Answers|Svar til øvelsen]])

:'''AVANCERET EMNE (Ej pensum):'''
::[[File:Phone_34.gif‎]] [http://www.cbs.dtu.dk/dtucourse/27611spring2011/video/27611_Advanced_BinFiles.mp4 Tekstfiler på binært niveau] (Video forelæsning, ~20mb, mpeg4 — Indlæst af Rasmus Wernersson, 2010)

:'''Baggrundsmateriale:...
::"[[Media:ELS_bioinformatics.pdf|Hvad er Bioinformatik?]]" — oversigtsartikel (PDF).

=== Tirsdag 7/2 — GenBank ===

:'''Forelæsning:''' ''Biologisk information, DNA struktur og sekventering, søgning i Genbank'' — Rasmus Wernersson.
:'''Pensum:''' [[Media:DNA_SequencingTutorial.pdf|DNA sequencing tutorial]] — kilde: IDT Tech Vault
:'''Udleveret materiale:''' [[Media:HandoutEx_BaseCalling_Simple.pdf‎|"Base-calling" øvelse]] [PDF], [[Media:GenBank+fasta handout dk.pdf|GenBank + FASTA format]] [PDF]
:'''Slides:''' på CampusNet (under "fildeling")

:'''Øvelse:''' [[ExGenbank-new|Brug af GenBank databasen]] - ([[ExGenbank-new-answers|svar til øvelsen]])

:'''Baggrundsmateriale (forudsættes kendt):'''
::[[File:Phone_34.gif‎]] [http://www.youtube.com/watch?v=YgmoHtLGb5c mRNA splicing] (YouTube).
::[[Media:GeneStructure.pdf|Oversigt over eukaryot gen-struktur]] (PDF).

:'''Yderligere materiale (ej pensum):'''
::[http://www.ncbi.nlm.nih.gov/books/NBK44863/ Entrez Sequences Quick Start] (NCBI)
::[[Media:GenBank_2017.pdf|Den årlige GenBank artikel, 2017]] (PDF)

=== Tirsdag 14/2 — Translation og UniProt ===

:'''Forelæsninger:'''
:*''Proteiner: data og databaser'' — Henrik Nielsen.
:*''Bioinformatik i den virkelige verden'' — Bent Petersen.
:'''Pensum:''' [[Media:VirtualRibosome.pdf|Virtual Ribosome]] — software artikel (PDF).
:'''Slides:''' (bliver lagt på CampusNet)

:'''Øvelser:'''
:#[[Exercise: Translation - Virtual Ribosome]] ([[ExTranslation-answers|svar]])
:#[[Exercise: The protein database UniProt]] ([[ExUniProt-answers|svar]])

:'''Baggrundsmateriale (forudsættes kendt):'''
::[http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/protein_handout.pdf Protein, sekvens og strukturniveauer] [PDF]
::[[Media:GeneStructure.pdf|Oversigt over eukaryot gen-struktur]] (PDF).
::[[File:Phone_34.gif‎]] [http://www.cbs.dtu.dk/dtucourse/27611spring2011/video/AminosyrerOgProteiner_TNP2010.mov Aminosyrer og Proteiner] — Kort videoforelæsning (16 min) med en genopfriskning af de vigtigste facts ang. aminosyrer og proteiner (Indlæst af Thomas Nordahl Petersen, 2010).

:'''Yderligere materiale (ej pensum):'''
::[[Media:UniProt_2017.pdf|Den årlige UniProt artikel, 2017]] (PDF)
::*link til Next-Generation Sequencing kurset: [http://www.kurser.dtu.dk/27626.aspx?menulanguage=da Next-Generation-Sequencing Analysis]

=== Tirsdag 21/2 — Parvis Alignment ===

:'''Forelæsning:''' ''Parvis alignment'' — Henrik Nielsen.
:'''Pensum:''' Side 35-55 i Immunological Bioinformatics (PDF - på CampusNet: Fildeling → Uddrag af lærebog).
:'''Slides:''' (På CampusNet).

:'''Handout øvelse:''' [[Media:New_handout_alignscores.pdf|Alignment scores]]

:'''Øvelse:''' [[ExPairwiseAlignment|Parvis alignment]] — svar: [[ExPairwiseAlignment-Answers|Parvis alignment svar]]

:'''AVANCERET EMNE (Ej pensum):'''
::[[File:Phone_34.gif‎]] Video-klip: (Rasmus Wernersson, 2008): [http://www.cbs.dtu.dk/dtucourse/27611spring2011/video/PodCast_DynamiskProgrammering_1024x768.mp4 Detaljeret gennemgang af Dynamisk Programmering]. - Mac/Windows (QuickTime/Mpeg4).

:'''Ekstra materiale:'''
::[[File:Phone_34.gif‎]] '''Optaget forelæsning:''' (kan bruges om reminder ang. dagens pensum — er på engelsk) [http://breeze.cbs.dtu.dk/p31243548/ Pairwise alignments + BLAST], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> - pass: <tt>jeglurer</tt>.
::Starter med parvis alignment (dækker også dynamisk programmering) — bruger samme handout øvelse som I selv arbejdede med til vores forelæsning.

=== Tirsdag 28/2 — Proteinstruktur, PDB & PyMOL ===

:'''Husk at medbringe en mus til øvelsen denne dag'''. Musen skal have tre knapper, hvoraf den midterste skal være et scroll-hjul.
:'''Forelæsning:''' ''Protein 3D structure'' — Paolo Marcatili (''NB: Forelæsningen vil foregå på engelsk'')
:'''Pensum:''' [http://en.wikipedia.org/w/index.php?title=Protein_structure&oldid=107127668 Protein Structure (Wikipedia - frosset version)] - Link til "Live" version [http://en.wikipedia.org/wiki/Protein_structure her].
:'''Bonus-videoforelæsning:''' [[File:Phone_34.gif‎]] [http://breeze.cbs.dtu.dk/p91129536/ Online videoforelæsning] ('''2010'''), Paolo Marcatili
:'''Slides:''' På CampusNet
:'''Link til avanceret kursus:'''
::* [[Course27617|Kursus 27617 - Proteinstruktur: Modeller, analyser og beregninger]]

:'''Andre relevante kurser:'''

::* [http://www.kurser.dtu.dk/26422.aspx?menulanguage=da Kursus 26422 - Biomolekylær kemi]
::* [http://www.kurser.dtu.dk/26426.aspx?menulanguage=da Kursus 26426 - Introduktion til medicinalkemi]


:'''Øvelser:'''
# PyMol tutorial ([https://cn.inside.dtu.dk/cnnet/filesharing/download/d192a082-7586-4bcd-8ee3-1aa174355d21 PDF på CampusNet]) - øvelse #1 - grundig gennemgang af basal brug af PyMol. (Ingen svar til denne øvelse: ikke nødvendigt).
# [[ExPyMol|Visualisering af proteinstrukturer i PyMOL]] - øvelse #2 - PDB databasen + visualisering i PyMol - '''Svar''' til Exercise 2: [[Protein_Structure_and_Visualization_Answers|Svar]] (NB: afsnittet "PyMOL magic" er ''ikke'' pensum, blot et tip hvis I senere skal bruge PyMOL)


=== Tirsdag 7/3 — Databasesøgning med BLAST ===

:'''Forelæsning:''' ''Introduktion til BLAST'' — Rasmus Wernersson.
:'''Pensum:''' sektion 3.2.5 → 3.3 (dvs. side 47-52) i Immunological Bioinformatics (PDF - på CampusNet: Fildeling → Uddrag af lærebog).
:'''Slides:''' På CampusNet.
:'''Øvelse:''' [[Exercise: BLAST]] - '''Svar''' til øvelsen: [[ExBlast-Answers|Blast svar]]


:'''Ekstra materiale:'''
::[[File:Phone_34.gif‎]] '''Optaget forelæsning:''' (kan bruges om reminder ang. dagens pensum - er på engelsk) [http://breeze.cbs.dtu.dk/p31243548/ Pairwise alignments + BLAST], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> - pass: <tt>jeglurer</tt>.
::BLAST delen starter ca. 1:05 inde i optagelsen.

::[[File:Phone_34.gif‎]] '''Videoer om BLAST fra NCBI:''' (Videointroduktion til NCBI's web interface og E-værdier/Expect Values) [http://www.youtube.com/playlist?list=PLH-TjWpFfWrtjzMCIvUe-YbrlIeFQlKMq NCBI's YouTube channel]

=== Tirsdag 14/3 — Malariavaccine ===

:'''Forelæsning:''' ''Malaria og vacciner (titlen kan blive ændret)'' — Jens Emil Vang Petersen.
:'''Slides:''' På CampusNet
:'''Baggrundsmateriale (bør læses før øvelsen):'''

:* [http://www.cdc.gov/dpdx/malaria/ Malaria — Causal Agents / Life Cycle]

:'''Øvelse:''' [[Opsamlende computerøvelse: Udvikling af malariavaccine]] ([[ExMalaria-answers|svar]])


------
<div align="center">
[[Image:Emblem-important_tiny.png‎]] '''Husk midtvejsevaluering:''' Gå til CampusNet -> Evaluering. [[Image:Emblem-important_tiny.png‎]]
</div>
------

=== Tirsdag 21/3 — Multiple Alignments ===
:'''Forelæsning:''' ''Multiple Alignments'' — Henrik Nielsen
:'''Pensum:''' RevTrans (artikel, [http://www.cbs.dtu.dk/dtucourse/27611spring2010/handouts/RevTrans.pdf PDF])
:'''Handout:''' Lokalisering af CDS navne i GenBank ([http://www.cbs.dtu.dk/dtucourse/27611spring2010/handouts/MultiGeneScreenshot.pdf PDF])
:'''Slides:''' På CampusNet.

:'''Øvelse:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] — Svar : [[ExMulAlign-Answers-English|Multiple Alignment svar]]

:'''Ekstra materiale:'''
::[[File:Phone_34.gif‎]] '''Optaget forelæsning:''' [http://breeze.cbs.dtu.dk/p20292453/ Multiple Alignments], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> - pass: <tt>jeglurer</tt>

=== Tirsdag 28/3 — Fylogenetiske træer ===

:'''Forelæsning:''' ''Fylogenetiske træer'' — Anders Gorm Pedersen.
:'''Pensum:''' "''Introduction to Treebuilding''" (PDF på Campusnet). [http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01 Evolutionary trees] (minus afsnittet "How to reconstruct an evolutionary tree"), [http://www.cbs.dtu.dk/courses/27615/pdf/understanding_evo_trees.pdf Understanding Evolutionary Trees].
:'''Udleveret materiale:''' handout øvelse [http://www.cbs.dtu.dk/dtucourse/27611spring2008/Ex09_Phylo/handout_distance.pdf Rekonstruktion af afstandstræ]
:'''Slides:''' På CampusNet.

:'''Link til avanceret kursus:'''
::* [http://www.kurser.dtu.dk/27615.aspx?menulanguage=da 27615 Molekylær evolution]

:'''Software til installering:''' [http://tree.bio.ed.ac.uk/software/figtree/ FigTree tree-viewer]


:'''Øvelse:''' [[Exercise: Phylogeny]] — Svar : [[Exercise:_Phylogeny-Answers|Fylogenetiske træer]]

=== Tirsdag 4/4 — Sekvensinformation og LOGO-plots ===

:'''Forelæsning:''' ''Sekvensinformation og LOGO-plots'' — Rasmus Wernersson.
:'''Pensum:'''
:# Side 68-80 i Immunological Bioinformatics (PDF - på CampusNet: Fildeling → Uddrag af lærebog).
:# Side 1-8 af "''Information theory primer''" ([http://www.cbs.dtu.dk/courses/27040/teachingmaterials/informationtheory_primer.pdf PDF])
:#* (Læs evt. også appendix'et om logaritmer (særligt om Log2), hvis du har brug for at genopfriske din viden).
:'''Supplerende pensum:''' [[Media:Logo_handout_new.pdf|Konstruktion af Logo plots]]
:'''Handouts til forelæsning:'''
:* [http://www.cbs.dtu.dk/dtucourse/27611spring2010/exercises/ExWeightmat/Ex_Logo.pdf How to construct sequence logos] . [http://www.cbs.dtu.dk/dtucourse/27611spring2010/exercises/ExWeightmat/Ex_Logo_ans.pdf Answers]

:'''Slides:''' På CampusNet.

:'''Øvelse:''' [[ExSeqLogos|DNA and Peptide LOGOs]] - '''svar:''' [[ExSeqLogosAnswers|Answers to the LOGO plot exercise]]



=== Tirsdag 18/4 — Vægtmatricer og andre forudsigelsesmetoder ===

:'''Forelæsning:''' ''Introduktion til forudsigelsesmetoder og vægtmatricer'' — Henrik Nielsen
:'''Pensum:''' ''Samme som sidste uge''.
:'''Slides:''' På CampusNet.
:'''Handouts til forelæsning:'''
:*[http://www.cbs.dtu.dk/courses/27625.algo/presentations/PSSM/Estimationofpseudocounts.pdf How to estimate pseudo frequencies] [http://www.cbs.dtu.dk/dtucourse/27611spring2010/exercises/ExWeightmat/Estimationofpseudocounts_answer_2010.pdf Answers]

:'''Øvelse:''' [[Exercise: Construction of sequence logos and weight matrices]] — Svar: [[ExLogo+Matrix-answers|Answers to exercise]]

=== Tirsdag 25/4 — Profilsøgning med PSI-BLAST ===

:'''Forelæsning:''' ''PSI-BLAST'' — Rasmus Wernersson.
:'''Pensum:''' ''Samme som sidste uge''.
:'''Slides:''' På CampusNet.
:'''Handouts til forelæsning:''' [http://www.cbs.dtu.dk/dtucourse/27611spring2010/exercises/ExPsiBlast/Psi_blast_ex.pdf Alignment using sequence profiles]

:'''Øvelse:''' [[ExPSIBLAST|PSI-BLAST]] — Svar til øvelsen: [[ExPSIBLAST_answer|PSI-BLAST answers]]

=== Tirsdag 2/5 — Bioinformatik i praksis + Øvelse: Gammelt eksamenssæt ===

:'''Forelæsning:''' ''Bioinformatik i lægemiddelindustrien'' — Rasmus Wernersson , [http://intomics.com Intomics A/S]

:'''Gammelt eksamenssæt:'''
:* [[Media:27611-Sommereksamen2014-endelig.pdf|SOMMEREKSAMEN 2014]] — [[Media:27611-sommereksamen2014-svar2016.pdf|Opdaterede svar 2016 (PDF)]]

=== Spørgetime ===

:'''Spørgetime''': — dato og tidspunkt aftales med deltagerne, så det ikke kolliderer med andre eksamenstidspunkter

== Eksamen ==

=== Tirsdag 23/5 ===

:'''SOMMEREKSAMEN 2017:''' Gå til CampusNet → Opgaver → Sommereksamen 2017
:* NB: der åbnes først for adgang fra klokken 9:00 den 23/5 2017.

=== Tjekliste til computere ===

Se her om din computer har alt det software der skal bruges til eksamen: [[Tjekliste til computere 27611]]

=== Linksamling ===

Samlet oversigt over de websites vi har brugt i kurset: [[Linksamling for 27611]]

=== Spørgsmål og svar ===

Spørgsmål der er blevet stillet pr. email, og lærernes svar på dem: [[FAQ for 27611]]

22111:Kursusplan for forår 2018

2024-03-15T11:49:25Z

WikiSysop: Created page with " ==Generel information== ===Undervisere / forelæsere=== * [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] — Lektor, kursusansvarlig. * [http://www.dtu.dk/service/telefonbog/person?id=21811&cpid=214076&tab=1 Bent Petersen] — Lektor, kursusansvarlig. * [http://www.cbs.dtu.dk/~raz/ Rasmus Wernersson] — Ekste..."

==Generel information==

===Undervisere / forelæsere===

* [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] — Lektor, kursusansvarlig.
* [http://www.dtu.dk/service/telefonbog/person?id=21811&cpid=214076&tab=1 Bent Petersen] — Lektor, kursusansvarlig.
* [http://www.cbs.dtu.dk/~raz/ Rasmus Wernersson] — Ekstern lektor, kursusansvarlig.
* [http://www.dtu.dk/service/telefonbog/person?id=34983&cpid=214024&tab=2&qt=dtupublicationquery Paolo Marcatili] — Lektor, gæstelærer. Emne: Proteinstruktur.
* Jens Emil Vang Petersen — PhD-studerende, Københavns Universitet, gæstelærer. Emne: Malariavacciner. 
* [http://www.dtu.dk/service/telefonbog/person?id=5118&cpid=214070&tab=2&qt=dtupublicationquery Anders Gorm Pedersen] — Professor, gæstelærer. Emne: Evolutionære træer.

===Assistenter ved øvelser===
* [https://www.inside.dtu.dk/da/dtuinside/generelt/telefonbog/person?id=98224&tab=0 Trine Zachariasen] — hjælpelærer.
* [https://www.inside.dtu.dk/da/dtuinside/generelt/telefonbog/person?id=98246&tab=0 David Lokjær Faurdal] — hjælpelærer.
* [https://www.inside.dtu.dk/da/dtuinside/generelt/telefonbog/person?id=88421&tab=0 Malene Revsbech Christiansen] — hjælpelærer, vikar for Trine den 20/3.



===Indhold===
I dette kursus er der lagt en stor vægt på praktisk anvendelse af de bioinformatiske værktøjer. En typisk lektion vil blive indledt med en teoretisk gennemgang af dagens emne (incl. nogle mindre øvelser/gruppearbejde) på en lille times tid, og resten af tiden vil blive brugt til praktiske øvelser på computer.

Se i øvrigt [http://www.kurser.dtu.dk/courses/36611/default.aspx kursusbasen om 36611].

===Pensum===
Udleverede noter og øvelsesmateriale: Der en ikke en formel lærebog. Der vil løbende blive udleveret kompendiemateriale; typisk i form af PDF filer på hjemmesiden. Vær opmærksom på at alle øvelsesvejledninger er pensum — og det gælder også svarene til øvelserne, som bliver lagt på hjemmesiden efter hver øvelsesgang!

===Computere===
'''I skal SELV medbringe bærbare computere''' der kan kobles til DTU's trådløse netværk. Typen af computer / operativsystem er ikke vigtigt — Windows, Mac og Linux vil alle virke fint.

Til øvelserne i "PDB & PyMol" samt "Malariavaccine" '''skal I medbringe en mus'''. Musen skal have tre knapper, hvoraf den midterste skal være et scroll-hjul.

Software:
# En moderne Internet Browser (fx. [http://www.google.com/chrome Google Chrome], [http://www.mozilla.com/ FireFox], [http://www.opera.com/ Opera], Safari for Mac eller Internet Explorer / Edge for Windows). '''NB:''' Det er vigtigt at have mere end én browser installeret — Internet Explorer eller Edge (indbygget i Windows) og Safari (indbygget i Mac) kan have knas med visse bioinformatik-websites, og så er det vigtigt at kunne skifte til en alternativ browser, der virker.
# JAVA: JAVA er nødvendigt at køre nogle af de programmer, vi skal bruge undervejs, bl.a. jEdit (se nedenfor). Java kan hentes gratis her: http://www.java.com — hvis det ikke allerede er installeret på din computer.
# JEdit: I kurset vil vi flere gange bruge JEdit til at kigge på tekst-baserede sekvensfiler — du kan med fordel installere den før første kursus-gang (gratis program): http://www.jedit.org. Hvis man får uløselige problemer med at få jEdit til at køre, så er et godt alternativ [http://geany.org/ Geany].

Øvrig software installeres i løbet af øvelserne.

===Hvor og hvornår===
Kurset består af forelæsning + efterfølgende øvelser, begge dele tirsdag eftermiddag. Forelæsningerne afholdes fra 13:00-14:00 (cirka) i '''bygning 208, auditorium 51'''. Øvelserne afholdes efterfølgende i bygning 210, holdlokalerne 042+048 samt grupperummene 066+068+070+072.

Første undervisningsgang er '''tirsdag den 30. januar'''.

===Afleveringer===
Som træning til den computer-baserede eksamen, skal hver gruppe skrive en "logbog" med svar på de spørgsmål der stilles i øvelserne (man må gerne skrive på dansk selv om teksten i øvelserne er på engelsk). Efter øvelsen skal I uploade jeres svar til CampusNet ('''Kursus 36611 → Opgaver''').

Det er muligt at aflevere som en gruppe. Vi vil ''meget'' hellere have en gruppeaflevering end et antal identiske besvarelser. Men husk at skrive alle gruppemedlemmers navne i dokumentet.

I må selv bestemme hvilket program I bruger til at skrive logbogen — f.eks. Microsoft Word, [http://www.libreoffice.org/ LibreOffice] (gratis), [http://www.openoffice.org/ Apache OpenOffice] (gratis), Pages til Mac eller lignende. Det er en fordel hvis I kan indsætte screenshots til at dokumentere hvad I har lavet. Microsoft Word har et indbygget screenshot-værktøj. Til Windows-brugere anbefaler vi i øvrigt det gratis program [http://getgreenshot.org/ Greenshot] til at tage screenshots og lave mindre redigeringer i dem.

Men uanset hvad I bruger, '''skal resultatet afleveres som PDF'''. Både Mac og Windows 10 har indbyggede funktioner til at konvertere alle dokumenter, der kan printes, til PDF. Hvis man har en tidligere version af Windows, må man installere et separat program. Der findes flere gratis alternativer; vi anbefaler [http://www.primopdf.com/ PrimoPDF]. (Det kan godt være en god ide at installere PrimoPDF, selv om man bruger Windows 10, det giver nogle flere muligheder, og de resulterende filer fylder mindre).

Vær venlige ''ikke'' at kopiere opgaveteksten i besvarelsen. Opgaveafleveringen på CampusNet har et system til detektion af plagiering, som giver udslag hvis der står en kopi af opgaveteksten i besvarelsen.

NB: '''Afleveringerne har ikke nogen indflydelse på jeres karakter''' — de er ment som en øvelse i brug af det system vi også skal bruge til eksamen. De er desuden en måde for os til at kontrollere forståelsen af undervisningen: hvis der er en bestemt fejl, som rigtig mange har lavet, kan vi måske forklare det bedre til næste undervisningsgang.

===Eksamen===
Eksamen i 36611 er elektronisk — d.v.s. at I skal medbringe egen computer og at I ''ikke'' får udleveret opgavesættet på papir. Opgavesættet kommer til at ligge som en PDF-fil på CampusNet. Afleveringen foregår også på CampusNet, lige som opgaveafleveringerne i løbet af kurset. Der skal afleveres i PDF.

Eksamen er med alle hjælpemidler og åbent internet. I må gerne medbringe bøger, artikler o.lign. Desuden har I via internettet adgang til alle de materialer, vi har brugt under kurset. I må også gerne søge information på Google, Wikipedia o.s.v. — I må bare ikke kommunikere med andre via email, Facebook, chat el.lign.

Ligesom i opgaveafleveringerne må vi bede om at man ''ikke'' kopierer opgaveteksten i besvarelsen. På den måde undgår man at besvarelsen bliver automatisk markeret som plagiering.

Når man afleverer sin eksamensbesvarelse på CampusNet, får man en kode som skal afleveres på papir til eksamensvagten. '''Det er meget vigtigt at koden bliver afleveret korrekt, ellers kan besvarelsen ikke godkendes.''' Koden bliver ændret, hvis man uploader en ny version af sin besvarelse; den er således en kontrol af, at man ikke har ændret i sin besvarelse, efter at man har forladt eksamenslokalet.

== DTU Inside ==
Link til DTU Inside gruppe for dette års kursus: https://cn.inside.dtu.dk/cnnet/element/563137

== Løbende evaluering og feedback ==
Vi modtager meget gerne kommentarer, forslag, kritik, ros mm. til undervisningen og undervisningsmaterialerne nårsomhelst. Du kan gøre dette enten pr. mail til lærerne eller ved at skrive en meddelelse på "Frit forum" i CampusNet. Du kan også svare på andres indlæg. Hvis der er et indlæg, du er enig i, så skriv meget gerne kommentaren "Enig!", så vi ved at der er flere, der mener det samme.

Desuden planlægger vi at holde en midtvejsevaluering i løbet af semesteret, ligeledes i CampusNet.

== Lektionsplan ==

=== Tirsdag 30/1 — Introduktion og Taksonomidatabaser ===

:'''Forelæsninger:'''
:* ''Introduktion til kurset, bioinformatik og computere'' — Henrik Nielsen.
:* ''Evolution & Taksonomi'' — Rasmus Wernersson.
:'''Pensum:''' [http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/Chapter2_Evolution.pdf Brief Introduction to Evolutionary Theory] — Skrevet/redigeret af Anders Gorm Pedersen.
:'''Slides:''' (Bliver lagt på CampusNet under "fildeling")
:'''Test af forhåndskundskaber:''' Gå til https://evaluering.dtu.dk/ , klik på "Test af forhåndskundskaber" under 36611 og udfyld skemaet (det er anonymt). Brug max. 10 minutter på det.
:'''Øvelser:'''
:# [[ExJEdit|JEdit]] - ([[ExJEdit-Answers|Svar til øvelsen]])
:# [[ExTaxonomy|Taksonomidatabaser]] - ([[ExTaxonomy-Answers|Svar til øvelsen]])

:'''AVANCERET EMNE (Ej pensum):'''
::[[File:Phone_34.gif‎]] [http://www.cbs.dtu.dk/dtucourse/27611spring2011/video/27611_Advanced_BinFiles.mp4 Tekstfiler på binært niveau] (Video forelæsning, ~20mb, mpeg4 — Indlæst af Rasmus Wernersson, 2010)

:'''Baggrundsmateriale:...
::"[[Media:ELS_bioinformatics.pdf|Hvad er Bioinformatik?]]" — oversigtsartikel (PDF).

=== Tirsdag 6/2 — GenBank ===

:'''Forelæsning:''' ''Biologisk information, DNA struktur og sekventering, søgning i Genbank'' — Henrik Nielsen.
:'''Pensum:''' [[Media:DNA_SequencingTutorial.pdf|DNA sequencing tutorial]] — kilde: IDT Tech Vault
:'''Udleveret materiale:''' [[Media:HandoutEx_BaseCalling_Simple.pdf‎|"Base-calling" øvelse]] [PDF], [[Media:GenBank+fasta handout dk.pdf|GenBank + FASTA format]] [PDF]
:'''Slides:''' på CampusNet (under "fildeling")

:'''Øvelse:''' [[ExGenbank-new|Brug af GenBank databasen]] - ([[ExGenbank-new-answers|svar til øvelsen]])

:'''Baggrundsmateriale (forudsættes kendt):'''
::[[File:Phone_34.gif‎]] [http://www.youtube.com/watch?v=YgmoHtLGb5c mRNA splicing] (YouTube).
::[[Media:GeneStructure.pdf|Oversigt over eukaryot gen-struktur]] (PDF).

:'''Yderligere materiale (ej pensum):'''
::[http://www.ncbi.nlm.nih.gov/books/NBK44863/ Entrez Sequences Quick Start] (NCBI)


=== Tirsdag 13/2 — Translation og UniProt ===

:'''Forelæsninger:'''
:*''Proteiner: data og databaser'' — Henrik Nielsen.
:*''Bioinformatik i den virkelige verden'' — Bent Petersen.
:'''Pensum:''' [[Media:VirtualRibosome.pdf|Virtual Ribosome]] — software artikel (PDF).
:'''Slides:''' (bliver lagt på CampusNet)

:'''Øvelser:'''
:#[[Exercise: Translation - Virtual Ribosome]] ([[ExTranslation-answers|svar]])
:#[[Exercise: The protein database UniProt]] ([[ExUniProt-answers|svar]])

:'''Baggrundsmateriale (forudsættes kendt):'''
::[http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/protein_handout.pdf Protein, sekvens og strukturniveauer] [PDF]
::[[Media:GeneStructure.pdf|Oversigt over eukaryot gen-struktur]] (PDF).
::[[File:Phone_34.gif‎]] [http://www.cbs.dtu.dk/dtucourse/27611spring2011/video/AminosyrerOgProteiner_TNP2010.mov Aminosyrer og Proteiner] — Kort videoforelæsning (16 min) med en genopfriskning af de vigtigste facts ang. aminosyrer og proteiner (Indlæst af Thomas Nordahl Petersen, 2010).


:'''Link til Next-Generation Sequencing kurset:''' [http://www.kurser.dtu.dk/36626.aspx?menulanguage=da 36626 Next-Generation-Sequencing Analysis]

=== Tirsdag 20/2 — Parvis Alignment ===

:'''Forelæsning:''' ''Parvis alignment'' — Henrik Nielsen/Rasmus Wernersson.
:'''Pensum:''' Side 35-55 i Immunological Bioinformatics (PDF - på CampusNet: Fildeling → Uddrag af lærebog).
:'''Slides:''' (På CampusNet).

:'''Handout øvelse:''' [[Media:New_handout_alignscores.pdf|Alignment scores]]

:'''Øvelse:''' [[ExPairwiseAlignment|Parvis alignment]] — svar: [[ExPairwiseAlignment-Answers|Parvis alignment svar]]

:'''Ekstra materiale:'''
::[[File:Phone_34.gif‎]] Video-klip: (Rasmus Wernersson, 2008): [http://www.cbs.dtu.dk/dtucourse/27611spring2011/video/PodCast_DynamiskProgrammering_1024x768.mp4 Detaljeret gennemgang af Dynamisk Programmering]. - Mac/Windows (QuickTime/Mpeg4).
::[[File:Phone_34.gif‎]] '''Optaget forelæsning:''' (kan bruges om reminder ang. dagens pensum — er på engelsk) [http://breeze.cbs.dtu.dk/p31243548/ Pairwise alignments + BLAST], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> - pass: <tt>jeglurer</tt>.
::Starter med parvis alignment (dækker også dynamisk programmering) — bruger samme handout øvelse som I selv arbejdede med til vores forelæsning.

=== Tirsdag 27/2 — Proteinstruktur, PDB & PyMOL ===

:'''Husk at medbringe en mus til øvelsen denne dag'''. Musen skal have tre knapper, hvoraf den midterste skal være et scroll-hjul.
:'''Forelæsning:''' ''Protein 3D structure'' — Paolo Marcatili (''NB: Forelæsningen vil foregå på engelsk'')
:'''Pensum:''' [http://en.wikipedia.org/w/index.php?title=Protein_structure&oldid=107127668 Protein Structure (Wikipedia - frosset version)] - Link til "Live" version [http://en.wikipedia.org/wiki/Protein_structure her].
:'''Bonus-videoforelæsning:''' [[File:Phone_34.gif‎]] [http://breeze.cbs.dtu.dk/p91129536/ Online videoforelæsning] ('''2010'''), Paolo Marcatili
:'''Slides:''' På CampusNet
:'''Link til avanceret kursus:'''
::* [http://teaching.bioinformatics.dtu.dk/36617/index.php/36617_-_Protein_Structure_and_Computational_Biology 36617 Protein Structure and Computational Biology]

:'''Andre relevante kurser:'''

::* [http://www.kurser.dtu.dk/26422.aspx?menulanguage=da Kursus 26422 - Biomolekylær kemi]
::* [http://www.kurser.dtu.dk/26426.aspx?menulanguage=da Kursus 26426 - Introduktion til medicinalkemi]


:'''Øvelser:'''
# [[Media:PyMol_tutorial2017_v4.pdf|PyMol tutorial]] (PDF)  - øvelse #1 - grundig gennemgang af basal brug af PyMol. (Ingen svar til denne øvelse: ikke nødvendigt).
# [[ExPyMol|Visualisering af proteinstrukturer i PyMOL]] - øvelse #2 - PDB databasen + visualisering i PyMol - '''Svar''' til Exercise 2: [[Protein_Structure_and_Visualization_Answers|Svar]] (NB: afsnittet "PyMOL magic" er ''ikke'' pensum, blot et tip hvis I senere skal bruge PyMOL)


=== Tirsdag 6/3 — Databasesøgning med BLAST ===

:'''Forelæsning:''' ''Introduktion til BLAST'' — Rasmus Wernersson.
:'''Pensum:''' sektion 3.2.5 → 3.3 (dvs. side 47-52) i Immunological Bioinformatics (PDF - på CampusNet: Fildeling → Uddrag af lærebog).
:'''Slides:''' På CampusNet.
:'''Øvelse:''' [[Exercise: BLAST]] - '''Svar''' til øvelsen: [[ExBlast-Answers|Blast svar]]

:'''Ekstra materiale:'''
::[[File:Phone_34.gif‎]] '''Optaget forelæsning:''' (kan bruges om reminder ang. dagens pensum - er på engelsk) [http://breeze.cbs.dtu.dk/p31243548/ Pairwise alignments + BLAST], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> - pass: <tt>jeglurer</tt>.
::BLAST delen starter ca. 1:05 inde i optagelsen.

::[[File:Phone_34.gif‎]] '''Videoer om BLAST fra NCBI:''' (Videointroduktion til NCBI's web interface og E-værdier/Expect Values) [http://www.youtube.com/playlist?list=PLH-TjWpFfWrtjzMCIvUe-YbrlIeFQlKMq NCBI's YouTube channel]

=== Tirsdag 13/3 — Malariavaccine ===

:'''Forelæsning:''' ''Malaria og vacciner (titlen kan blive ændret)'' — Jens Emil Vang Petersen.
:'''Slides:''' På CampusNet
:'''Baggrundsmateriale (bør læses før øvelsen):'''

:* [http://www.cdc.gov/dpdx/malaria/ Malaria — Causal Agents / Life Cycle]

:'''Øvelse:''' [[Opsamlende computerøvelse: Udvikling af malariavaccine]] ([[ExMalaria-answers|svar]])




=== Tirsdag 20/3 — Multiple Alignments ===
:'''Forelæsning:''' ''Multiple Alignments'' — Henrik Nielsen
:'''Pensum:''' RevTrans (artikel, [http://www.cbs.dtu.dk/dtucourse/27611spring2010/handouts/RevTrans.pdf PDF])
:'''Handout:''' Lokalisering af CDS navne i GenBank ([http://www.cbs.dtu.dk/dtucourse/27611spring2010/handouts/MultiGeneScreenshot.pdf PDF])
:'''Slides:''' På CampusNet.
:[[Image:Emblem-important_tiny.png‎]] '''Husk midtvejsevaluering:''' Gå til https://evaluering.dtu.dk/ og klik på "Midtvejsevaluering F18" under 36611 [[Image:Emblem-important_tiny.png‎]]
:'''Øvelse:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] — Svar : [[ExMulAlign-Answers-English|Multiple Alignment svar]]

:'''Ekstra materiale:'''
::[[File:Phone_34.gif‎]] '''Optaget forelæsning:''' [http://breeze.cbs.dtu.dk/p20292453/ Multiple Alignments], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> - pass: <tt>jeglurer</tt>

------
<div align="center">
[[Image:Easter-egg-free-to-use-cliparts.png|25px]] '''Påskeferie''' [[Image:Easter-egg-free-to-use-cliparts.png|25px]]
</div>
------

=== Tirsdag 3/4 — Fylogenetiske træer ===

:'''Forelæsning:''' ''Fylogenetiske træer'' — Anders Gorm Pedersen.
:'''Pensum:''' "''Introduction to Treebuilding''" (PDF på Campusnet). [http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01 Evolutionary trees] (minus afsnittet "How to reconstruct an evolutionary tree"), [http://www.cbs.dtu.dk/courses/27615/pdf/understanding_evo_trees.pdf Understanding Evolutionary Trees].
:'''Udleveret materiale:''' handout øvelse [http://www.cbs.dtu.dk/dtucourse/27611spring2008/Ex09_Phylo/handout_distance.pdf Rekonstruktion af afstandstræ]
:'''Slides:''' På CampusNet.

:'''Link til avanceret kursus:'''
::* [http://teaching.bioinformatics.dtu.dk/36615/index.php/36615_-_Computational_Molecular_Evolution 36615 Computational Molecular Evolution]

:'''Software til installering:''' [http://tree.bio.ed.ac.uk/software/figtree/ FigTree tree-viewer]


:'''Øvelse:''' [[Exercise: Phylogeny]] — Svar : [[Exercise:_Phylogeny-Answers|Fylogenetiske træer]]

=== Tirsdag 10/4 — Sekvensinformation og LOGO-plots ===

:'''Forelæsning:''' ''Sekvensinformation og LOGO-plots'' — Rasmus Wernersson.
:'''Pensum:'''
:# Side 68-80 i Immunological Bioinformatics (PDF - på CampusNet: Fildeling → Uddrag af lærebog).
:# Side 1-8 af "''Information theory primer''" ([http://www.cbs.dtu.dk/courses/27040/teachingmaterials/informationtheory_primer.pdf PDF])
:#* (Læs evt. også appendix'et om logaritmer (særligt om Log2), hvis du har brug for at genopfriske din viden).
:'''Supplerende pensum:''' [[Media:Logo_handout_new.pdf|Konstruktion af Logo plots]]
:'''Handouts til forelæsning:'''
:* [http://www.cbs.dtu.dk/dtucourse/27611spring2010/exercises/ExWeightmat/Ex_Logo.pdf How to construct sequence logos] 

:'''Slides:''' På CampusNet.

:'''Øvelse:''' [[ExSeqLogos|DNA and Peptide LOGOs]] - '''svar:''' [[ExSeqLogosAnswers|Svar til LOGO-plot øvelsen]]



=== Tirsdag 17/4 — Vægtmatricer og andre forudsigelsesmetoder ===

:'''Forelæsning:''' ''Introduktion til forudsigelsesmetoder og vægtmatricer'' — Henrik Nielsen
:'''Pensum:''' ''Samme som sidste uge''.
:'''Slides:''' På CampusNet.
:'''Handouts til forelæsning:'''
:*[http://www.cbs.dtu.dk/courses/27625.algo/presentations/PSSM/Estimationofpseudocounts.pdf How to estimate pseudo frequencies] [http://www.cbs.dtu.dk/dtucourse/27611spring2010/exercises/ExWeightmat/Estimationofpseudocounts_answer_2010.pdf Answers]

:'''Øvelse:''' [[Exercise: Construction of sequence logos and weight matrices]] — Svar: [[ExLogo+Matrix-answers|Answers to exercise]]

=== Tirsdag 24/4 — Profilsøgning med PSI-BLAST ===

:'''Forelæsning:''' ''PSI-BLAST'' — Henrik Nielsen.
:'''Pensum:''' ''Samme som sidste uge''.
:'''Slides:''' På CampusNet.


:'''Øvelse:''' [[ExPSIBLAST|PSI-BLAST]] — Svar til øvelsen: [[ExPSIBLAST_answer|PSI-BLAST answers]]

=== Tirsdag 1/5 — Bioinformatik i praksis + Øvelse: Gammelt eksamenssæt ===


'''Forelæsningen denne dag er aflyst. Vi mødes i stedet kl. 13:00 i øvelseslokalerne!'''

:'''Gammelt eksamenssæt:'''
:* [[Media:27611-Sommereksamen2014-endelig.pdf|SOMMEREKSAMEN 2014]] — [[Media:27611-sommereksamen2014-svar2018.pdf|Opdaterede svar 2018 (PDF)]]

=== Spørgetime ===

:'''Spørgetime:''' Onsdag 23/5 kl 11-12 i Aud. 51. 

== Eksamen ==

=== Torsdag 24/5 2018 ===

:'''SOMMEREKSAMEN 2018:''' Gå til CampusNet → Opgaver → Sommereksamen 2018
:* NB: der åbnes først for adgang fra klokken 9:00 den 24/5 2018.

=== Tjekliste til computere ===

Se her om din computer har alt det software der skal bruges til eksamen: [[Tjekliste til computere 36611]]

=== Linksamling ===

Samlet oversigt over de websites vi har brugt i kurset: [[Linksamling for 36611]]

=== Spørgsmål og svar ===

Spørgsmål der er blevet stillet pr. email, og lærernes svar på dem: [[FAQ for 36611]]

22111:Course plan spring 2019

2024-03-15T11:48:57Z

WikiSysop: Created page with "== General information == === Teachers === * [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] — Associate professor, course responsible. * [https://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] — External associate professor, course responsible. * [http://www.dtu.dk/service/telefonbog/person?id=34983&cpid=214024&tab=2&qt=dtupublicationqu..."

== General information ==

=== Teachers ===

* [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=214126&tab=2&qt=dtupublicationquery Henrik Nielsen] — Associate professor, course responsible.
* [https://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] — External associate professor, course responsible.
* [http://www.dtu.dk/service/telefonbog/person?id=34983&cpid=214024&tab=2&qt=dtupublicationquery Paolo Marcatili] — Associate professor, guest lecturer. Topic: Protein structure.

* Rasmus Weisel Jensen — PhD student (University of Copenhagen), guest lecturer. Topic: Malaria vaccines.

=== Teaching assistants ===
* [https://www.inside.dtu.dk/da/dtuinside/generelt/telefonbog/person?id=98224&tab=0 Trine Zachariasen] — Teaching assistant.
* [https://www.inside.dtu.dk/da/dtuinside/generelt/telefonbog/person?id=98246&tab=0 David Lokjær Faurdal] — Teaching assistant.
* [https://www.inside.dtu.dk/da/dtuinside/generelt/telefonbog/person?id=97559&tab=0 Clara Christina Ekebjærg] — Replacement for Trine on Tuesday Feb 5.

=== Course content ===
In this course, a large emphasis is placed on the practical usage of bioinformatics databases and tools. A typical lecture will present the theoretical aspects of the topics of the day — sometimes including a small group exercise using pen and paper — and last about an hour. The rest of the time will be spent on practical computer exercises, where the teachers and teaching assistants will be ready to help.

See also [http://kurser.dtu.dk/course/36611 the course base about 36611].

=== Curriculum ===
There is no formal textbook. The curriculum consists of the exercise guides, supplemented with various papers and chapters which will be made available on this homepage or on DTU Inside. Please note that ''all'' exercise guides are mandatory curriculum — including the ''answers'' to the exercises which will be made available on this homepage after each exercise.

=== Computers ===
'''You must bring your own laptop''' to the exercises, and it must be able to connect to DTU's wireless network. The type of computer / operating system is not important; Windows, Mac or Linux will all work fine. An iPad or an Android tablet, on the other hand, will not be good enough.

In some of the exercises ("PDB/PyMOL", "Malaria vaccine", and "Old exam questions"), you will work with the molecular visualization program PyMOL. This is rather difficult to control by a touchpad, so please remember to '''bring a mouse'''. The mouse should have two buttons plus a scroll-wheel.

Software:
# Most importantly: an updated '''internet browser''' (e.g. [http://www.google.com/chrome Google Chrome], [http://www.mozilla.com/ FireFox], [http://www.opera.com/ Opera], Safari for Mac or Edge for Windows). '''NB:''' You must have more than one browser installed; Safari for Mac or Edge for Windows may have glitches with some bioinformatics websites, and in those cases it is important to be able to switch to an alternative browser.
# '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). If it is not already installed on your computer, you can download it for free from http://www.java.com.
# A plain text editor for working with, e.g., sequence files. We recommend '''jEdit''', which you can download for free from http://www.jedit.org. If you experience unsolvable problems installing or running jEdit, there are alternatives, e.g. [http://geany.org/ Geany].
Other software will be installed during the exercises.

=== Where and when ===
Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting '''Tuesday Feb 5'''. Lectures will be from 13:00 to approx. 14 in building 208, auditorium 51, and the exercises will then take place in building 210, rooms 142+148 and group rooms 102+104+106+108+110. 

=== Hand-ins ===
As preparation for the computer-based exam, each participant or group must write a "'''logbook'''" with answers to the questions posed in the exercise guides. After the exercise, you should upload the logbook to DTU Inside (Course 36611 → Assignments).

'''NB:''' It is possible to hand in as a group. We would ''much'' rather receive a group hand-in than a number of identical logbooks.

You decide which software you prefer for writing the logbook — e.g. Microsoft Word, [http://www.libreoffice.org/ LibreOffice] (free), [http://www.openoffice.org/ Apache OpenOffice] (free), Pages for Mac or similar. You should be able to insert '''screenshots''' in the logbooks for documentation purposes. Microsoft Word has a built-in screenshot tool. For Windows users, however, we recommend the free program [http://getgreenshot.org/ Greenshot] which can not only take screenshots and copy them to the clipboard, but also make simple edits and annotations in the screenshots.

Regardless of your choice of writing software, the result '''must be handed in as a PDF file'''. MacOS X and Windows 10 have built-in functions for converting any printable file to PDF. Users of earlier versions of Windows must install a separate program. Several free alternatives exist, e.g. [http://www.primopdf.com/ PrimoPDF]. (It can be a good idea to install PrimoPDF even for Windows 10 users, it provides some extra options and the resulting files take up less space).

'''Please do ''not'' copy the questions''' from the exercise guide to your logbook. The hand-in module on DTU Inside has a system for plagiarism detection, which will raise an alarm if significant portions of your hand-in are identical to documents found on the internet — and that includes the exercise guides.

'''NB:''' ''The hand-ins do not affect your grade'' — they are mainly meant as a preparation for the exam. They are also a means for us to check the understanding of the teaching; if we can see that many participants have made the same mistake, we will try to explain the issue better at the next lecture.

=== Exam ===
The 36611 exam is electronic; i.e. you must bring your own computer, and you will ''not'' get a paper copy of the questions. The questions will be made available as a PDF file on DTU Inside. Hand-in also takes place on DTU Inside, and the procedure is the same as in the exercises. The only accepted hand-in format is PDF.

All aids are allowed at the exam; you can bring any books, papers or notes. You will have '''open access to the internet''' which includes all the materials and websites we have used during the course. You are also allowed to search information on Google, Wikipedia, etc., but you are ''not'' allowed to communicate with others through e-mail, Facebook, chat, or file sharing websites. The internet traffic will be logged during the exam to ensure that these restrictions are kept.

Just like in the weekly hand-ins, we kindly ask you: ''Please don't copy the questions in your answer document'' — that might result in the answer being flagged as plagiarism.

When you hand in the exam assignment in DTU Inside, you get a '''hand-in code''' that must be written on the hand-in envelope. '''It is ''very'' important that the code is handed in correctly''', otherwise your answer cannot be assessed. The code serves as a control that the hand-in has not been modified since you left the examination room.

=== DTU Inside ===
Link to this year's DTU Inside group: https://cn.inside.dtu.dk/cnnet/element/590477/

=== Evaluation and feedback ===
We will be very happy to receive comments, suggestions, criticisms, or praise at any time during the semester. You can send these by email to the teachers or write them on "Forum" in the DTU Inside group. If somebody writes a message on the Forum, you can comment on it. If you see a message you agree on, please comment "Agree!" so that we can see that it is not just one person's opinion.

In addition, we will conduct a mid-term evaluation in DTU Inside.

== Lecture & exercise plan ==

Note: This is a ''preliminary'' plan, changes may occur!

=== Tuesday Feb 5 — Introduction & taxonomy ===
:'''Lectures:'''
:* ''Introduction to the course, bioinformatics, and computers'' — Henrik Nielsen.
:* ''Evolution and taxonomy'' — Rasmus Wernersson.
:'''Slides:''' will be made available on DTU Inside File sharing.
:'''Curriculum:''' [http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/Chapter2_Evolution.pdf Brief Introduction to Evolutionary Theory] — Written by Anders Gorm Pedersen.
:'''Test of prior knowledge:''' Go to https://evaluering.dtu.dk/, click "Test of prior knowledge" under 36611, and fill out the form (it's anonymous). Spend max. 10 minutes on it.
:'''Exercises:'''
:# [[Plain text files and jEdit]] — ([[ExJEdit-Answers|Answers]])
:# [[Taxonomy databases]] — ([[ExTaxonomy-Answers|Answers]])
:'''Extra material'''
::"[[Media:ELS_bioinformatics.pdf|Bioinformatics]]" — Encyclopedia entry from 2009.

=== Tuesday Feb 12 — GenBank ===
:'''Lecture:''' ''DNA as Biological Information'' — Rasmus Wernersson
:'''Curriculum:''' [[Media:DNA_SequencingTutorial.pdf|DNA sequencing tutorial]] — source: IDT Tech Vault
:'''Handout''' for the lecture: [[Media:HandoutEx_BaseCalling_Simple.pdf‎|"Base-calling" exercise]] [PDF]
:'''Slides:''' on DTU Inside File sharing.

:'''Test of prior knowledge:''' Go to https://evaluering.dtu.dk/, click "Test of prior knowledge" under 36611, and fill out the form (it's anonymous). Spend max. 10 minutes on it.
:'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] — ([[ExGenbank-new-answers|Answers]])
:'''Reference material''' for the exercise: [[Media:GenBank+fasta handout dk.pdf|GenBank + FASTA format]] [PDF]

:'''Background material''' (supposedly known):
::[[File:Phone_34.gif‎]] [http://www.youtube.com/watch?v=YgmoHtLGb5c mRNA splicing] (YouTube).
::[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
::[http://www.ncbi.nlm.nih.gov/books/NBK44863/ Entrez Sequences Quick Start] (NCBI)
::[https://academic.oup.com/nar/article/47/D1/D94/5144964 "GenBank"] — article from the annual database issue of Nucleic Acids Research, 2019.

=== Tuesday Feb 19 — Translation & UniProt ===
:'''Lecture:''' ''Protein databases'' — Henrik Nielsen
:'''Curriculum:''' [[Media:VirtualRibosome.pdf|Virtual Ribosome]] — software article (PDF).
:'''Slides:''' on DTU Inside File sharing.

:'''Exercises:'''
:#[[Exercise: Translation - Virtual Ribosome]] — ([[ExTranslation-answers|Answers]])
:#[[Exercise: The protein database UniProt]] — ([[ExUniProt-answers|Answers]])

:'''Background material''' (supposedly known):
::[http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/protein_handout.pdf Levels of protein structure] [PDF]
::[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
::[https://academic.oup.com/nar/article/47/D1/D506/5160987 "UniProt: a worldwide hub of protein knowledge"] — article from the annual database issue of Nucleic Acids Research, 2019.

=== Tuesday Feb 26 — Pairwise alignment ===
:'''Lecture:''' ''Pairwise alignment'' — Henrik Nielsen.
:'''Curriculum:''' Page 35-55 in Immunological Bioinformatics (PDF: on DTU Inside File sharing → Textbook excerpt).
:'''Handout''' for the lecture: [[Media:New_handout_alignscores.pdf|Alignment scores]]
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] — ([[ExPairwiseAlignment-AnswersEng|Answers]])
:'''Extra material:'''
::[[File:Phone_34.gif‎]] '''Recorded lecture:''' (may be used as a reminder) [http://breeze.cbs.dtu.dk/p31243548/ Pairwise alignments + BLAST], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> — pass: <tt>jeglurer</tt>. NB: requires Adobe Flash.

=== Tuesday Mar 5 — Protein structure, PDB & PyMOL ===
:'''Remember to bring a mouse for this day's exercise.''' The mouse should have two buttons and a scroll wheel.
:'''Lecture:''' ''Protein 3D structure'' — Paolo Marcatili
:'''Curriculum:''' [http://en.wikipedia.org/wiki/Protein_structure Protein Structure (Wikipedia)]
:'''Slides:''' on DTU Inside File sharing.

:'''Link to advanced course:'''
::* [http://teaching.bioinformatics.dtu.dk/36617/index.php/36617_-_Protein_Structure_and_Computational_Biology 36617 Protein Structure and Computational Biology]

:'''Software''' for installation: [https://pymol.org/2/ PyMOL]
:'''Exercises:'''
:#[[Media:PyMol_tutorial2017_v4.pdf|PyMol tutorial]] (PDF) — basic usage of PyMOL.
:#[[ExPyMol|The PDB database and visualization in PyMOL]] — ([[Protein_Structure_and_Visualization_Answers|Answers]] — NB: the last section labelled "PyMOL magic" is NOT curriculum, just a tip!)

:'''Extra material:'''
::[https://academic.oup.com/nar/article/47/D1/D520/5144142 "Protein Data Bank: the single global archive for 3D macromolecular structure data"] — article from the annual database issue of Nucleic Acids Research, 2019.

=== Tuesday Mar 12 — BLAST ===
:'''Lecture:''' ''Introduction to BLAST'' — Rasmus Wernersson.
:'''Curriculum:''' section 3.2.5 → 3.3 (i.e. pages 47-52) in Immunological Bioinformatics (PDF: on DTU Inside File sharing → Textbook excerpt).
:'''Slides:''' on DTU Inside File sharing.
:'''[[Exercise: BLAST]]''' — ([[ExBlast-Answers|Answers]])
:'''Extra material:'''
::[[File:Phone_34.gif‎]] '''Recorded lecture:''' (may be used as a reminder) [http://breeze.cbs.dtu.dk/p31243548/ Pairwise alignments + BLAST], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> — pass: <tt>jeglurer</tt>. NB: requires Adobe Flash.
::[[File:Phone_34.gif‎]] '''Videos about BLAST from NCBI:''' (Video introduction to NCBI's web interface and Expect Values) [http://www.youtube.com/playlist?list=PLH-TjWpFfWrtjzMCIvUe-YbrlIeFQlKMq NCBI's YouTube channel]

=== Tuesday Mar 19 — Case: Malaria vaccine ===
:'''Lecture:''' ''Malaria and vaccines'' — Rasmus Weisel Jensen (KU).
:'''Curriculum:''' [http://www.cdc.gov/dpdx/malaria/ Malaria — Causal Agents / Life Cycle]
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' [[Exercise:Malaria Vaccine|Malaria vaccine]] — ([[Answers:Malaria Vaccine|Answers]])

=== Tuesday Mar 26 — Sequence information & logo-plots ===
:'''Lecture:''' ''Sequence information & logo-plots'' — Rasmus Wernersson
:'''Curriculum:'''
:# Pages 68-80 in Immunological Bioinformatics (PDF: on DTU Inside File sharing → Textbook excerpt).
:# Pages 1-8 of "''Information theory primer''" ([http://www.cbs.dtu.dk/courses/27040/teachingmaterials/informationtheory_primer.pdf PDF])
:#* Read also the appendix on logarithms (especially Log2) if needed!
:'''Slides:''' on DTU Inside File sharing.
:'''Handout''' for the lecture: [[Media:logo_exercise.pdf|How to construct sequence logos]] (PDF)
:[[Image:Emblem-important_tiny.png‎]] '''Mid-term evaluation:''' Go to https://evaluering.dtu.dk/ and click "Mid-term evaluation" under 36611 [[Image:Emblem-important_tiny.png‎]]
:'''Exercise:''' [[ExSeqLogos|DNA and Peptide Logos]] — ([[ExSeqLogosAnswers|Answers]])

=== Tuesday Apr 2 — Weight matrices and other prediction methods ===
:'''Lecture:''' ''Introduction to prediction methods, especially Weight Matrices'' — Henrik Nielsen
:'''Curriculum:''' Same as last week!
:'''Slides:''' on DTU Inside File sharing.
:'''Handouts''' for the lecture: [http://www.cbs.dtu.dk/courses/27625.algo/presentations/PSSM/Estimationofpseudocounts.pdf How to estimate pseudo frequencies]
:'''[[Exercise: Construction of sequence logos and weight matrices]]''' - ([[ExLogo+Matrix-answers|Answers]])

=== Tuesday Apr 9 — PSI-BLAST ===
:'''Lecture:''' ''PSI-BLAST'' — Rasmus Wernersson
:'''Curriculum:'''
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] — ([[ExPSIBLAST_answer|Answers]])

------
<div align="center">
[[Image:Easter-egg-free-to-use-cliparts.png|25px]] '''Easter holidays''' [[Image:Easter-egg-free-to-use-cliparts.png|25px]]
</div>
------

=== Tuesday Apr 23 — Multiple alignments ===
:'''Lecture:''' ''Multiple alignment'' — Henrik Nielsen
:'''Curriculum:''' RevTrans ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC169015/ article])
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] — ([[ExMulAlign-Answers-English|Answers]])
:'''Extra material:'''
::[[File:Phone_34.gif‎]] '''Recorded lecture:''' [http://breeze.cbs.dtu.dk/p20292453/ Multiple Alignments], Anders Gorm Pedersen 2010
::login: <tt>viewer@cbs.dtu.dk</tt> - pass: <tt>jeglurer</tt> NB: Requires Flash

=== Tuesday Apr 30 — Phylogenetic trees ===
:'''Lecture:''' ''Phylogenetic Reconstruction: Distance Matrix Methods'' — Rasmus Wernersson
:'''Curriculum:'''
:# ''Introduction to Tree Building'', PDF on Inside File sharing → Slides etc → Lecture12
:# ''[http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01 Evolutionary trees]'' (minus the section "How to reconstruct an evolutionary tree")
:# ''Understanding Evolutionary Trees'', [http://www.cbs.dtu.dk/courses/27615/pdf/understanding_evo_trees.pdf PDF].
:'''Slides:''' on DTU Inside File sharing.
:'''Handout''' for lecture: [http://www.cbs.dtu.dk/dtucourse/27611spring2008/Ex09_Phylo/handout_distance.pdf Reconstructing a distance tree]
:'''Software''' for installation: [http://tree.bio.ed.ac.uk/software/figtree/ FigTree tree-viewer]
:'''[[Exercise: Phylogeny]]''' — ([[Exercise:_Phylogeny-Answers|Answers]])
:'''Link to advanced course:'''
::* [http://teaching.bioinformatics.dtu.dk/36615/index.php/36615_-_Computational_Molecular_Evolution 36615 Computational Molecular Evolution]

=== Tuesday May 7 — Bioinformatics in practice + old exam questions ===
:'''Lecture:''' "Real life case": ''Bioinformatics and Systems Biology in precision medicine'' - Rasmus Wernersson
:'''Curriculum:''' (None - lean back and enjoy)
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' We train on the old exam set from '''2017''' - available on the file share.

== Exam ==

=== Friday May 24 ===
'''Summer exam 2019:''' In the DTU Inside group for 36611, go to Assignments → Summer exam 2019.

The assignment will be accessible from '''15:00''' on Friday May 24.

=== Checklist for computers ===
Check here whether your computer has all the software needed for the exam: [[Checklist for computers]]

=== Link collection ===
A quick overview of the websites we have used in the course: [[Link collection]]

22111:Course plan spring 2020

2024-03-15T11:47:52Z

WikiSysop: Created page with "== General information == === Teachers === * [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=257116&tab=2&qt=dtupublicationquery Henrik Nielsen] — Associate professor, course responsible. * [https://www.dtu.dk/service/telefonbog/person?id=18103&tab=2&qt=dtupublicationquery Rasmus Wernersson] — External associate professor, course responsible. * [http://www.dtu.dk/service/telefonbog/person?id=34983&tab=2&qt=dtupublicationquery Paolo Marcatili] &md..."

== General information ==

=== Teachers ===

* [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=257116&tab=2&qt=dtupublicationquery Henrik Nielsen] — Associate professor, course responsible.
* [https://www.dtu.dk/service/telefonbog/person?id=18103&tab=2&qt=dtupublicationquery Rasmus Wernersson] — External associate professor, course responsible.
* [http://www.dtu.dk/service/telefonbog/person?id=34983&tab=2&qt=dtupublicationquery Paolo Marcatili] — Associate professor, guest lecturer. Topic: Protein structure.
* [http://www.dtu.dk/service/telefonbog/person?id=5118&tab=2&qt=dtupublicationquery Anders Gorm Pedersen] — Professor, guest lecturer. Topic: Phylogenetic trees.
* Rasmus Weisel Jensen — PhD student (University of Copenhagen), guest lecturer. Topic: Malaria vaccines.

=== Teaching assistants ===
* [https://www.dtu.dk/service/telefonbog/person?id=149708 Amelie Fritz] — PhD student
* [https://www.dtu.dk/service/telefonbog/person?id=140521 Anna-Lisa Schaap-Johansen] — PhD student
* [https://www.dtu.dk/service/telefonbog/person?id=88268 Camilla Koldbæk Lemvigh] — PhD student
* [https://www.dtu.dk/service/telefonbog/person?id=88645 Helle Rus Povlsen] — PhD student
* [https://www.inside.dtu.dk/en/dtuinside/generelt/telefonbog/person?id=59520 Jesper Vang] — PhD student

=== Course content ===
In this course, a large emphasis is placed on the practical usage of bioinformatics databases and tools. A typical lecture will present the theoretical aspects of the topics of the day — sometimes including a small group exercise using pen and paper — and last about an hour. The rest of the time will be spent on practical computer exercises, where the teachers and teaching assistants will be ready to help.

See also [http://kurser.dtu.dk/course/22111 the course base about 22111].

=== Curriculum ===
There is no formal textbook. The curriculum consists of the exercise guides, supplemented with various papers and chapters which will be made available on this homepage or on DTU Inside. Please note that ''all'' exercise guides are mandatory curriculum — including the ''answers'' to the exercises which will be made available on this homepage after each exercise.

=== Computers ===
'''You must bring your own laptop''' to the exercises, and it must be able to connect to DTU's wireless network. The type of computer / operating system is not important; Windows, Mac or Linux will all work fine. An iPad or an Android tablet, on the other hand, will not be good enough. A Chromebook will also not be enough (unless you have succeeded in installing a Linux distribution on it, but in that case we assume you know what you're doing).

In some of the exercises ("PDB/PyMOL", "Malaria vaccine", and "Old exam questions"), you will work with the molecular visualization program PyMOL. This is rather difficult to control by a touchpad, so please remember to '''bring a mouse'''. The mouse should have two buttons plus a scroll-wheel.

Software:
# Most importantly: an updated '''internet browser''' (e.g. [http://www.google.com/chrome Google Chrome], [http://www.mozilla.com/ FireFox], [http://www.opera.com/ Opera], Safari for Mac or Edge for Windows). '''NB:''' You must have more than one browser installed; Safari for Mac or Edge for Windows may have glitches with some bioinformatics websites, and in those cases it is important to be able to switch to an alternative browser.
# '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). If it is not already installed on your computer, there are two options for downloading it:
## Oracle java can be downloaded from http://www.java.com. '''Note:''' it is not truly free anymore; if you want to use it for any commercial activities, you need to pay a license fee.
## A truly free alternative is AdoptOpenJDK which can be downloaded from https://adoptopenjdk.net/. '''Windows users:''' Choose "OpenJDK 8" and "HotSpot", and remember to ''enable all subfeatures'' when installing, otherwise it will not work with the software we are going to use. '''Mac users:''' we currently don't recommend this solution for Mac, since jEdit (see below) for Mac does not play nice with AdoptOpenJDK.
# A plain text editor for working with, e.g., sequence files. We recommend '''jEdit''', which you can download for free from http://www.jedit.org. If you experience unsolvable problems installing or running jEdit, there are alternatives, e.g. [http://geany.org/ Geany].
Other software will be installed during the exercises.

=== Where and when ===
Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting '''Tuesday Feb 4'''. Lectures will be from 13:00 to approx. 14 in '''building 208, auditorium 53''', and the exercises will then take place in '''building 210, rooms 142+148+118'''.

=== Hand-ins ===
As preparation for the computer-based exam, each participant or group must write a "'''logbook'''" with answers to the questions posed in the exercise guides. After the exercise, you should upload the logbook to DTU Inside (Course 22111 → Assignments).

'''NB:''' It is possible to hand in as a group. We would ''much'' rather receive a group hand-in than a number of identical logbooks.

You decide which software you prefer for writing the logbook — e.g. Microsoft Word, [http://www.libreoffice.org/ LibreOffice] (free), [http://www.openoffice.org/ Apache OpenOffice] (free), Pages for Mac, [https://docs.google.com/ Google Docs] or similar. You should be able to insert '''screenshots''' in the logbooks for documentation purposes. Microsoft Word has a built-in screenshot tool. For Windows users, however, we recommend the free program [http://getgreenshot.org/ Greenshot] which can not only take screenshots and copy them to the clipboard, but also make simple edits and annotations in the screenshots.

Regardless of your choice of writing software, the result '''must be handed in as a PDF file'''. MacOS X and Windows 10 have built-in functions for converting any printable file to PDF. Users of earlier versions of Windows must install a separate program. Several free alternatives exist, e.g. [http://www.primopdf.com/ PrimoPDF]. (It can be a good idea to install PrimoPDF even for Windows 10 users, it provides some extra options and the resulting files take up less space).

'''Please do ''not'' copy the questions''' from the exercise guide to your logbook. The hand-in module on DTU Inside has a system for plagiarism detection, which will raise an alarm if significant portions of your hand-in are identical to documents found on the internet — and that includes the exercise guides.

'''NB:''' ''The hand-ins do not affect your grade'' — they are mainly meant as a preparation for the exam. They are also a means for us to check the understanding of the teaching; if we can see that many participants have made the same mistake, we will try to explain the issue better at the next lecture.

=== Exam ===
The 22111 exam is electronic; i.e. you must bring your own computer, and you will ''not'' get a paper copy of the questions. The questions will be made available as a PDF file on DTU Inside. Hand-in also takes place on DTU Inside, and the procedure is the same as in the exercises. The only accepted hand-in format is PDF.

All aids are allowed at the exam; you can bring any books, papers or notes. You will have '''open access to the internet''' which includes all the materials and websites we have used during the course. You are also allowed to search information on Google, Wikipedia, etc., but you are ''not'' allowed to communicate with others through e-mail, Facebook, chat, or file sharing websites. The internet traffic will be logged during the exam to ensure that these restrictions are kept.

Just like in the weekly hand-ins, we kindly ask you: ''Please don't copy the questions in your answer document'' — that might result in the answer being flagged as plagiarism.

When you hand in the exam assignment in DTU Inside, you get a '''hand-in code''' that must be written on the hand-in envelope. '''It is ''very'' important that the code is handed in correctly''', otherwise your answer cannot be assessed. The code serves as a control that the hand-in has not been modified since you left the examination room.

=== DTU Inside ===
Link to this year's DTU Inside group: https://cn.inside.dtu.dk/cnnet/element/612495/

=== Evaluation and feedback ===
We will be very happy to receive comments, suggestions, criticisms, or praise at any time during the semester. You can send these by email to the teachers or write them on "Forum" in the DTU Inside group. If somebody writes a message on the Forum, you can comment on it. If you see a message you agree on, please comment "Agree!" so that we can see that it is not just one person's opinion.

In addition, we will conduct a mid-term evaluation in DTU Inside.

== Lecture & exercise plan ==

Note: This is a ''preliminary'' plan, changes may occur!

=== Tuesday Feb 4 — Introduction & taxonomy ===
:'''Lectures:'''
:* ''Introduction to the course, bioinformatics, and computers'' — Henrik Nielsen.
:* ''Evolution and taxonomy'' — Rasmus Wernersson.
:'''Slides:''' will be made available on DTU Inside File sharing.
:'''Curriculum:''' [http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/Chapter2_Evolution.pdf Brief Introduction to Evolutionary Theory] — Written by Anders Gorm Pedersen.
:'''Test of prior knowledge:''' Go to https://evaluering.dtu.dk/, click "Test of prior knowledge" under 22111, and fill out the form (it's anonymous). Spend max. 10 minutes on it.
:'''Exercises:'''
:# [[Plain text files and jEdit]] — ([[ExJEdit-Answers|Answers]])
:# [[Taxonomy databases]] — ([[ExTaxonomy-Answers|Answers]])
:'''Extra material'''
::"[[Media:ELS_bioinformatics.pdf|Bioinformatics]]" — Encyclopedia entry from 2009.

=== Tuesday Feb 11 — GenBank ===
:'''Lecture:''' ''DNA as Biological Information'' — Rasmus Wernersson
:'''Curriculum:''' [[Media:DNA_SequencingTutorial.pdf|DNA sequencing tutorial]] — source: IDT Tech Vault
:'''Handout''' for the lecture: [[Media:HandoutEx_BaseCalling_Simple.pdf‎|"Base-calling" exercise (for printing)]] [PDF] - [[Media:BaseCalling_on_screen_version.pdf|"Base-calling exercise (version for on-screen viewing)]] [PDF] (NEW, Feb 2020).
:'''Slides:''' on DTU Inside File sharing.


:'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] — ([[ExGenbank-new-answers|Answers]])
:'''Reference material''' for the exercise: [[Media:GenBank+FASTA_handout_revised.pdf|GenBank + FASTA format]] [PDF] (UPDATED, Feb 11th 2020)

:'''Background material''' (supposedly known):
::[[File:Phone_34.gif‎]] [http://www.youtube.com/watch?v=YgmoHtLGb5c mRNA splicing] (YouTube).
::[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
::[http://www.ncbi.nlm.nih.gov/books/NBK44863/ Entrez Sequences Quick Start] (NCBI)
::[https://academic.oup.com/nar/article/47/D1/D94/5144964 "GenBank"] — article from the annual database issue of Nucleic Acids Research, 2019.
::[https://academic.oup.com/nar/article/48/D1/D84/5608994 "GenBank"] — article from the annual database issue of Nucleic Acids Research, 2020.

=== Tuesday Feb 18 — Translation & UniProt ===
:'''Lecture:''' ''Protein databases'' — Henrik Nielsen
:'''Curriculum:''' [[Media:VirtualRibosome.pdf|Virtual Ribosome]] — software article (PDF).
:'''Slides:''' on DTU Inside File sharing.

:'''Exercises:'''
:#[[Exercise: Translation - Virtual Ribosome]] — ([[ExTranslation-answers|Answers]])
:#[[Exercise: The protein database UniProt]] — ([[ExUniProt-answers|Answers]])

:'''Background material''' (supposedly known):
::[http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/protein_handout.pdf Levels of protein structure] [PDF]
::[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
::[https://academic.oup.com/nar/article/47/D1/D506/5160987 "UniProt: a worldwide hub of protein knowledge"] — article from the annual database issue of Nucleic Acids Research, 2019.

=== Tuesday Feb 25 — Pairwise alignment ===
:'''Lecture:''' ''Pairwise alignment'' — Henrik Nielsen.
:'''Curriculum:''' Page 35-55 in Immunological Bioinformatics (PDF: on DTU Inside File sharing → Textbook excerpt).
:'''Handout''' for the lecture: [[Media:New_handout_alignscores.pdf|Alignment scores]]
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] — ([[ExPairwiseAlignment-AnswersEng|Answers]])


=== Tuesday Mar 3 — Protein structure, PDB & PyMOL ===
:'''Remember to bring a mouse for this day's exercise.''' The mouse should have two buttons and a scroll wheel.
:'''Lecture:''' ''Protein 3D structure'' — Paolo Marcatili
:'''Curriculum:''' [http://en.wikipedia.org/wiki/Protein_structure Protein Structure (Wikipedia)]
:'''Slides:''' on DTU Inside File sharing.

:'''Link to advanced course:'''
::* [http://teaching.healthtech.dtu.dk/36617/index.php/22117_-_Protein_Structure_and_Computational_Biology 22117 Protein Structure and Computational Biology]

:'''Software''' for installation: [https://pymol.org/2/ PyMOL]
::'''Note:''' you will need the license file found at the file sharing under this week's folder.
:'''Exercises:'''
:#[[Media:PyMol_tutorial2017_v4.pdf|PyMol tutorial]] (PDF) — basic usage of PyMOL.
:#[[Protein Structure and Visualization]] — ([[Protein_Structure_and_Visualization_Answers|Answers]] — NB: the last section labelled "PyMOL magic" is NOT curriculum, just a tip!)

:'''Extra material:'''
:*[https://academic.oup.com/nar/article/47/D1/D520/5144142 "Protein Data Bank: the single global archive for 3D macromolecular structure data"] — article from the annual database issue of Nucleic Acids Research, 2019.
:*[[PyMOL]] — some tips and tricks.
:*[http://www.cbs.dtu.dk/~blicher/Courses/PyMOL_structure_navigation.pdf PyMOL basics — a small example] (optional extra exercise)

=== Tuesday Mar 10 — BLAST ===
:'''Lecture:''' ''Introduction to BLAST'' — Rasmus Wernersson.
:'''Curriculum:''' section 3.2.5 → 3.3 (i.e. pages 47-52) in Immunological Bioinformatics (PDF: on DTU Inside File sharing → Textbook excerpt).
:'''Slides:''' on DTU Inside File sharing.
:'''[[Exercise: BLAST]]''' — ([[ExBlast-Answers|Answers]])
:'''Extra material:'''

::[[File:Phone_34.gif‎]] '''Videos about BLAST from NCBI:''' (Video introduction to NCBI's web interface and Expect Values) [http://www.youtube.com/playlist?list=PLH-TjWpFfWrtjzMCIvUe-YbrlIeFQlKMq NCBI's YouTube channel]

=== Tuesday Mar 17 — Case: Malaria vaccine ===
:'''Lecture:''' ''Malaria and vaccines'' — Rasmus Weisel Jensen (KU).
:'''Curriculum:''' [http://www.cdc.gov/dpdx/malaria/ Malaria — Causal Agents / Life Cycle]
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' [[Exercise:Malaria Vaccine|Malaria vaccine]] — ([[Answers:Malaria Vaccine|Answers]])

=== Tuesday Mar 24 — Sequence information & logo-plots ===
:'''Lecture:''' ''Sequence information & logo-plots'' — Rasmus Wernersson
:'''Curriculum:'''
:# Pages 68-80 in Immunological Bioinformatics (PDF: on DTU Inside File sharing → Textbook excerpt).
:# Pages 1-8 of "''Information theory primer''" ([http://www.cbs.dtu.dk/courses/27040/teachingmaterials/informationtheory_primer.pdf PDF])
:#* Read also the appendix on logarithms (especially Log2) if needed!
:'''Slides:''' on DTU Inside File sharing.
:'''Handout''' for the lecture: [[Media:logo_exercise.pdf|How to construct sequence logos]] (PDF)
:[[Image:Emblem-important_tiny.png‎]] '''Mid-term evaluation:''' Go to https://evaluering.dtu.dk/ and click "Mid-term evaluation" under 22111 [[Image:Emblem-important_tiny.png‎]]
:'''Exercise:''' [[ExSeqLogos|DNA and Peptide Logos]] — ([[ExSeqLogosAnswers|Answers]])

=== Tuesday Mar 31 — Weight matrices and other prediction methods ===
:'''Lecture:''' ''Introduction to prediction methods, especially Weight Matrices'' — Henrik Nielsen
:'''Curriculum:''' Same as last week!
:'''Slides:''' on DTU Inside File sharing.
:'''Handouts''' for the lecture: [[Media:Estimationofpseudocounts_new+examples.pdf|How to estimate pseudo frequencies]] ([[Media:Estimationofpseudocounts_answer_brief.pdf|Answer]]) 
:'''[[Exercise: Construction of sequence logos and weight matrices]]''' - ([[ExLogo+Matrix-answers|Answers]])
:'''Link to advanced course: '''
:: [http://teaching.healthtech.dtu.dk/22125/ 22125: Algorithms in bioinformatics]

------
<div align="center">
[[Image:Easter-egg-free-to-use-cliparts.png|25px]] '''Easter holidays''' [[Image:Easter-egg-free-to-use-cliparts.png|25px]]
</div>
------

=== Tuesday Apr 14 — PSI-BLAST ===
:'''Lecture:''' ''PSI-BLAST'' — Rasmus Wernersson
:'''Curriculum:'''
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] — ([[ExPSIBLAST_answer|Answers]])

=== Tuesday Apr 21 — Multiple alignments ===
:'''Lecture:''' ''Multiple alignment'' — Henrik Nielsen
:'''Curriculum:''' RevTrans ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC169015/ article])
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] — ([[ExMulAlign-Answers-English|Answers]])


=== Tuesday Apr 28 — Phylogenetic trees ===
:'''Lecture:''' ''Phylogenetic Reconstruction: Distance Matrix Methods'' — Anders Gorm Pedersen
:'''Curriculum:'''
:# ''Introduction to Tree Building'', PDF on Inside File sharing → Slides etc → Lecture12
:# ''[http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01 Evolutionary trees]'' (minus the section "How to reconstruct an evolutionary tree")
:# ''Understanding Evolutionary Trees'', [http://www.cbs.dtu.dk/courses/27615/pdf/understanding_evo_trees.pdf PDF].
:'''Slides:''' on DTU Inside File sharing.
:'''Handout''' for lecture: [http://www.cbs.dtu.dk/dtucourse/27611spring2008/Ex09_Phylo/handout_distance.pdf Reconstructing a distance tree] ([[Media:handout_distance_answers.pdf|Answer]])
:'''Software''' for installation: [http://tree.bio.ed.ac.uk/software/figtree/ FigTree tree-viewer]
:'''[[Exercise: Phylogeny]]''' — ([[Exercise:_Phylogeny-Answers|Answers]])
:'''Link to advanced course:'''
::* [http://teaching.healthtech.dtu.dk/22115/ 22115 Computational Molecular Evolution]

=== Tuesday May 5 — Bioinformatics in practice + old exam questions ===
:'''Lecture:''' "Real life case": ''Bioinformatics and Systems Biology in precision medicine'' - Rasmus Wernersson
:'''Curriculum:''' (None - lean back and enjoy)
:'''Slides:''' on DTU Inside File sharing.
:'''Exercise:''' We train on the old exam set from '''2017''' - available on the file share.

== Exam ==

=== Tuesday May 26 ===
'''Summer exam 2020:''' Go to http://onlineeksamen.dtu.dk/ and find 22111.


The assignment will be accessible from '''9:00''' on Tuesday May 26.

=== Checklist for computers ===
Check here whether your computer has all the software needed for the exam: [[Checklist for computers]]

=== Link collection ===
A quick overview of the websites we have used in the course: [[Link collection]]

=== FAQ ===
Questions we have received and answered: [[FAQ]]

22111:Course plan spring 2021

2024-03-15T11:47:06Z

WikiSysop: Created page with "== General information == === Where and when === Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting '''Tuesday Feb 2 at 13:00'''. '''In 2021, the course will be held online during the entire semester.'''  For online lectures and exercises, we use Mi..."

== General information ==

=== Where and when ===
Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting '''Tuesday Feb 2 at 13:00'''.

'''In 2021, the course will be held online during the entire semester.'''


For online lectures and exercises, we use Microsoft Teams.  The code for joining this year's team is:
tvii8l9
Please see the '''[[MS Teams instructions]]!'''


=== Teachers ===

* [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=257116&tab=2&qt=dtupublicationquery Henrik Nielsen] — Associate professor, course responsible.
* [https://www.dtu.dk/service/telefonbog/person?id=18103&tab=2&qt=dtupublicationquery Rasmus Wernersson] — External associate professor, course responsible.
* [http://www.dtu.dk/service/telefonbog/person?id=34983&tab=2&qt=dtupublicationquery Paolo Marcatili] — Associate professor, guest lecturer. Topic: Protein structure.
* [http://www.dtu.dk/service/telefonbog/person?id=5118&tab=2&qt=dtupublicationquery Anders Gorm Pedersen] — Professor, guest lecturer. Topic: Phylogenetic trees.


=== Teaching assistants ===
* [https://www.dtu.dk/service/telefonbog/person?id=149708 Amelie Fritz] — PhD student
* [https://www.dtu.dk/service/telefonbog/person?id=140521 Anna-Lisa Schaap-Johansen] — PhD student
* [https://www.dtu.dk/service/telefonbog/Person?id=158424 Klara Marie Andersen] — PhD student
* [https://www.dtu.dk/service/telefonbog/Person?id=97554 Marianne Helenius] — PhD student
* [https://www.inside.dtu.dk/en/dtuinside/generelt/telefonbog/person?id=144521 Morgane Chauvet] — MSc student


=== Course content ===
In this course, a large emphasis is placed on the practical usage of bioinformatics databases and tools. A typical lecture will present the theoretical aspects of the topics of the day — sometimes including a small group exercise using pen and paper — and last about an hour. The rest of the time will be spent on practical computer exercises, where the teachers and teaching assistants will be ready to help.

See also [http://kurser.dtu.dk/course/22111 the course base about 22111].

=== Curriculum ===
There is no formal textbook. The curriculum consists of the exercise guides, supplemented with various papers and chapters which will be made available on this homepage or on DTU Learn. Please note that ''all'' exercise guides are mandatory curriculum — including the ''answers'' to the exercises which will be made available on DTU Learn after each exercise.

=== Computers ===
====Hardware====
'''You must bring your own laptop''' to the exercises, and it must be able to connect to DTU's wireless network. The type of computer / operating system is not important; Windows, Mac or Linux will all work fine. An iPad or an Android tablet, on the other hand, will not be good enough. A Chromebook will also not be enough (unless you have succeeded in installing a Linux distribution on it, but in that case we assume you know what you're doing).

In some of the exercises ("PDB/PyMOL", "Malaria vaccine", and "Old exam questions"), you will work with the molecular visualization program PyMOL. This is rather difficult to control by a touchpad, so please remember to '''bring a mouse'''. The mouse should have two buttons plus a scroll-wheel.

====Software====
# Most importantly: an updated '''internet browser''' (e.g. [http://www.google.com/chrome Google Chrome], [http://www.mozilla.com/ FireFox], [http://www.opera.com/ Opera], [https://www.microsoft.com/edge Edge] for Windows or Mac OS, or Safari for Mac only). '''NB:''' You must have more than one browser installed; Safari for Mac or Edge for Windows may have glitches with some bioinformatics websites, and in those cases it is important to be able to switch to an alternative browser.
# '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). Download from here: https://adoptopenjdk.net/ - choose OpenJDK 11 and HotSpot.
#* '''NOTE:''' Oracle java 8 from https://java.com/en/ is NOT good enough for jEdit anymore. jEdit version 5.6 and later needs java 11 which is available from the above link (and from a few other places). AdoptOpenJDK is open source and free, even for commercial use, which is not the case for Oracle java anymore.
#* '''IMPORTANT TIP''' for Windows users: You need to enable the sub-feature named "set JAVA_HOME variable" when installing java 11.
# A plain text editor for working with, e.g., sequence files. We recommend '''jEdit''', which you can download for free from http://www.jedit.org. If you experience unsolvable problems installing or running jEdit, there are alternatives, e.g. [http://geany.org/ Geany].
#* '''NOTE:''' The jEdit developers have not signed the installation package, therefore both Windows and MacOS will complain when you first attempt to install it, and you have to insist that it is OK to run the program. For Macs, this is a bit complicated, see the instructions in [[ExJEdit#Download_and_Install_jEdit|the exercise guide]].
Other software will be installed during the exercises.

=== Hand-ins ===
As preparation for the computer-based exam, each participant or group must write a "'''logbook'''" with answers to the questions posed in the exercise guides. After the exercise, you should upload the logbook to DTU Learn.


You decide which software you prefer for writing the logbook — e.g. Microsoft Word, [http://www.libreoffice.org/ LibreOffice] (free), [http://www.openoffice.org/ Apache OpenOffice] (free), Pages for Mac, [https://docs.google.com/ Google Docs] or similar. You should be able to insert '''screenshots''' in the logbooks for documentation purposes. Microsoft Word has a built-in screenshot tool. Both Windows 10 and Mac OS also have dedicated screenshot tools.


Regardless of your choice of writing software, the result '''must be handed in as a PDF file'''. LibreOffice and Google Docs can make PDFs directly. MacOS and Windows 10 have built-in functions for converting any printable file to PDF. Users of earlier versions of Windows must install a separate program. Several free alternatives exist, e.g. [http://www.primopdf.com/ PrimoPDF]. (It can be a good idea to install PrimoPDF even for Windows 10 users, it provides some extra options and the resulting files take up less space).

'''Please do ''not'' copy the questions''' from the exercise guide to your logbook. The hand-in module on DTU Learn has a system for plagiarism detection, which will raise an alarm if significant portions of your hand-in are identical to documents found on the internet — and that includes the exercise guides.

'''NB:''' ''The hand-ins do not affect your grade'' — they are mainly meant as a preparation for the exam. They are also a means for us to check the understanding of the teaching; if we can see that many participants have made the same mistake, we will try to explain the issue better at the next lecture.

=== Exam ===
The 22111 exam is electronic; i.e. you must bring your own computer, and you will ''not'' get a paper copy of the questions. The questions will be made available as a PDF file on the DTU online exam system.  The only accepted hand-in format is PDF.

All aids are allowed at the exam; you can bring any books, papers or notes. You will have '''open access to the internet''' which includes all the materials and websites we have used during the course. You are also allowed to search information on Google, Wikipedia, etc., but you are ''not'' allowed to communicate with others through e-mail, Facebook, chat, or file sharing websites. The internet traffic will be logged during the exam to ensure that these restrictions are kept.

Just like in the weekly hand-ins, we kindly ask you: ''Please don't copy the questions in your answer document'' — that might result in the answer being flagged as plagiarism.


=== DTU Learn & Inside ===
Link to this year's DTU Learn page: https://learn.inside.dtu.dk/d2l/home/60286

Link to this year's DTU Inside group: https://cn.inside.dtu.dk/cnnet/element/633911/

=== Evaluation and feedback ===
We will be very happy to receive comments, suggestions, criticisms, or praise at any time during the semester. You can:
* send them by email to the teachers, or
* write them under "General feedback" in "Discussion" in the DTU Inside group (found in the Course content menu), or
* write them in the "General" channel in MS Teams.
If somebody writes a message in the "Discussion" or "General" channel, you can comment on it. If you see a message you agree on, please comment "Agree!" so that we can see that it is not just one person's opinion.

In addition, we will conduct a mid-term evaluation in [https://evaluering.dtu.dk/ DTU evaluation].

== Lecture & exercise plan ==

Note: This is a ''preliminary'' plan, changes may occur!

=== Tuesday Feb 2 — Introduction & taxonomy ===
:'''Lectures:'''
:* ''Introduction to the course, bioinformatics, and computers'' — Henrik Nielsen.
:* ''Evolution and taxonomy'' — Rasmus Wernersson.
:'''Slides:''' will be made available on DTU Learn.
:'''Curriculum:''' [http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/Chapter2_Evolution.pdf Brief Introduction to Evolutionary Theory] — Written by Anders Gorm Pedersen.
:'''Test of prior knowledge:''' Go to https://evaluering.dtu.dk/, click "Test of prior knowledge" under 22111, and fill out the form (it's anonymous). Spend max. 10 minutes on it.
:'''Exercises:'''
:# [[Plain text files and jEdit]] 
:# [[Taxonomy databases]] 
:'''Extra material'''
:*"[[Media:ELS_bioinformatics.pdf|Bioinformatics]]" — Encyclopedia entry from 2009.
:*"[https://academic.oup.com/nar/article/49/D1/D10/5937080 Database resources of the National Center for Biotechnology Information]" — article from the annual database issue of Nucleic Acids Research, 2021.

=== Tuesday Feb 9 — GenBank ===
:'''Lecture:''' ''DNA as Biological Information'' — Rasmus Wernersson
:'''Curriculum:''' [[Media:DNA_SequencingTutorial.pdf|DNA sequencing tutorial]] — source: IDT Tech Vault
:'''Handout''' for the lecture: [[Media:HandoutEx_BaseCalling_Simple.pdf‎|"Base-calling" exercise (for printing)]] [PDF] / [[Media:BaseCalling_on_screen_version.pdf|"Base-calling" exercise (version for on-screen viewing)]] [PDF].
:'''Slides:''' on DTU Learn.


:'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] — ([[ExGenbank-new-answers|Answers]])
:'''Reference material''' for the exercise: [[Media:GenBank+FASTA_handout_revised.pdf|GenBank + FASTA format]] [PDF]

:'''Background material''' (supposedly known):
:*[[File:Phone_34.gif‎]] [http://www.youtube.com/watch?v=YgmoHtLGb5c mRNA splicing] (YouTube).
:*[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
:*[http://www.ncbi.nlm.nih.gov/books/NBK44863/ Entrez Sequences Quick Start] (NCBI)
:*[https://academic.oup.com/nar/article/49/D1/D92/5983623 "GenBank"] — article from the annual database issue of Nucleic Acids Research, 2021.

=== Tuesday Feb 16 — Translation & UniProt ===
:'''Lecture:''' ''Protein databases'' — Henrik Nielsen
:'''Curriculum:''' [[Media:VirtualRibosome.pdf|Virtual Ribosome]] — software article (PDF).
:'''Slides:''' on DTU Learn.

:'''Exercises:'''
:#[[Exercise: Translation - Virtual Ribosome]] — ([[ExTranslation-answers|Answers]])
:#[[Exercise: The protein database UniProt]] — ([[ExUniProt-answers|Answers]])
:'''Background material''' (supposedly known):
:*[http://www.cbs.dtu.dk/dtucourse/27611spring2011/PDF/protein_handout.pdf Levels of protein structure] [PDF]
:*[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
:*[https://academic.oup.com/nar/article/49/D1/D480/6006196 "UniProt: the universal protein knowledgebase in 2021"] — article from the annual database issue of Nucleic Acids Research, 2021.
:*[[Media:uniprotkb_quickguide.pdf|"A Quick Guide to UniProtKB"]] — nice printable overview.

=== Tuesday Feb 23 — Pairwise alignment ===
:'''Lecture:''' ''Pairwise alignment'' — Henrik Nielsen.
:'''Curriculum:''' Page 35-55 in Immunological Bioinformatics (PDF: on DTU Learn → General information and files → Textbook excerpt).
:'''Handout''' for the lecture: [[Media:New_handout_alignscores.pdf|Alignment scores]]
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] 


=== Tuesday Mar 2 — Protein structure, PDB & PyMOL ===
:'''Remember to bring a mouse for this day's exercise.''' The mouse should have two buttons and a scroll wheel.
:'''Lecture:''' ''Protein 3D structure'' — Paolo Marcatili
:'''Curriculum:''' [http://en.wikipedia.org/wiki/Protein_structure Protein Structure (Wikipedia)]
:'''Slides:''' on DTU Learn.

:'''Link to advanced course:'''
::* [http://teaching.healthtech.dtu.dk/36617/index.php/22117_-_Protein_Structure_and_Computational_Biology 22117 Protein Structure and Computational Biology]

:'''Software''' for installation: [https://pymol.org/2/ PyMOL]
::'''Note:''' you will need the license file found at DTU Learn under this week's topic.
:'''Exercises:'''
:#[[Media:PyMol_tutorial2017_v4.pdf|PyMol tutorial]] (PDF) — basic usage of PyMOL.
:#[[Protein Structure and Visualization]] — ([[Protein_Structure_and_Visualization_Answers|Answers]]; NB: the last section labelled "PyMOL magic" is NOT curriculum, just a tip!)

:'''Extra material:'''
:*[https://academic.oup.com/nar/article/49/D1/D437/5992282 "RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences"] — article from the annual database issue of Nucleic Acids Research, 2021.
:*[[PyMOL]] — some tips and tricks.
:*[http://www.cbs.dtu.dk/~blicher/Courses/PyMOL_structure_navigation.pdf PyMOL basics — a small example] (optional extra exercise)

=== Tuesday Mar 9 — BLAST ===
:'''Lecture:''' ''Introduction to BLAST'' — Rasmus Wernersson.
:'''Curriculum:''' section 3.2.5 → 3.3 (i.e. pages 47-52) in Immunological Bioinformatics (PDF: on DTU Learn).
:'''Slides:''' on DTU Learn.
:'''[[Exercise: BLAST]]''' 
:'''Extra material:'''

::[[File:Phone_34.gif‎]] '''Videos about BLAST from NCBI:''' (Video introduction to NCBI's web interface and Expect Values) [http://www.youtube.com/playlist?list=PLH-TjWpFfWrtjzMCIvUe-YbrlIeFQlKMq NCBI's YouTube channel]

=== Tuesday Mar 16 — Case: Malaria vaccine ===
:'''Lecture:''' ''Malaria and vaccines'' — [https://cmp.ku.dk/staff/?pure=en/persons/226923 Thomas Lavstsen], Associate Professor, University of Copenhagen
:'''Curriculum:''' [http://www.cdc.gov/dpdx/malaria/ Malaria — Causal Agents / Life Cycle]
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[Exercise:Malaria Vaccine|Malaria vaccine]] 

=== Tuesday Mar 23 — Sequence information & logo-plots ===
:'''Lecture:''' ''Sequence information & logo-plots'' — Henrik Nielsen
:'''Curriculum:'''
:# Pages 68-80 in Immunological Bioinformatics (PDF: on DTU Learn).
:# Pages 1-8 of "''Information theory primer''" ([http://www.cbs.dtu.dk/courses/27040/teachingmaterials/informationtheory_primer.pdf PDF])
:#* Read also the appendix on logarithms (especially Log2) if needed!
:'''Slides:''' on DTU Learn.
:'''Handout''' for the lecture: [[Media:logo_exercise.pdf|How to construct sequence logos]] (PDF)
:[[Image:Emblem-important_tiny.png‎]] '''Mid-term evaluation:''' Go to https://evaluering.dtu.dk/ and click "Mid-term evaluation" under 22111 [[Image:Emblem-important_tiny.png‎]]
:'''Exercise:''' [[ExSeqLogos|DNA and Peptide Logos]] 

------
<div align="center">
[[Image:Easter-egg-free-to-use-cliparts.png|25px]] '''Easter holidays''' [[Image:Easter-egg-free-to-use-cliparts.png|25px]]
</div>
------

=== Tuesday Apr 6 — Weight matrices and other prediction methods ===
:'''Lecture:''' ''Introduction to prediction methods, especially Weight Matrices'' — Henrik Nielsen
:'''Curriculum:''' Same as last week!
:'''Slides:''' on DTU Learn.
:'''Handouts''' for the lecture: [[Media:Estimationofpseudocounts_new+examples.pdf|How to estimate pseudo frequencies]]  
:'''Exercise:''' [[Exercise: Construction of sequence logos and weight matrices|Construction of weight matrices]] 
:'''Link to advanced course: '''
:: [http://teaching.healthtech.dtu.dk/22125/ 22125: Algorithms in bioinformatics]

=== Tuesday Apr 13 — PSI-BLAST ===
:'''Lecture:''' ''PSI-BLAST'' — Rasmus Wernersson
:'''Curriculum:'''
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] 

=== Tuesday Apr 20 — Multiple alignments ===
:'''Lecture:''' ''Multiple alignment'' — Henrik Nielsen
:'''Curriculum:''' RevTrans ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC169015/ article])
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] 


=== Tuesday Apr 27 — Phylogenetic trees ===
:'''Lecture:''' ''Phylogenetic Reconstruction: Distance Matrix Methods'' — Anders Gorm Pedersen
:'''Curriculum:'''
:# ''Introduction to Tree Building'', PDF on Learn 
:# ''[http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01 Evolutionary trees]'' (minus the section "How to reconstruct an evolutionary tree")
:# ''Understanding Evolutionary Trees'', [http://www.cbs.dtu.dk/courses/27615/pdf/understanding_evo_trees.pdf PDF].
:'''Slides:''' on DTU Learn.
:'''Handout''' for lecture: [http://www.cbs.dtu.dk/dtucourse/27611spring2008/Ex09_Phylo/handout_distance.pdf Reconstructing a distance tree] 
:'''Software''' for installation: [http://tree.bio.ed.ac.uk/software/figtree/ FigTree tree-viewer]
::'''IMPORTANT NOTE''' for Windows users: Download the <tt>.zip</tt> file and unpack it. Then, go to the "lib" subfolder and double-click the <tt>.jar</tt> file. The <tt>.exe</tt> file will not work.
:'''[[Exercise: Phylogeny]]''' 
:'''Link to advanced course:'''
::* [http://teaching.healthtech.dtu.dk/22115/ 22115 Computational Molecular Evolution]

=== Tuesday May 4 — Bioinformatics in practice + old exam questions ===
:'''Lecture:''' "Real life case": ''Bioinformatics and Systems Biology in precision medicine'' - Rasmus Wernersson
:'''Curriculum:''' (None - lean back and enjoy)
:'''Slides:''' on DTU Learn.
:'''Exercise:''' We train on the old exam set from '''2020''' - available on DTU Learn.

== Exam ==

=== Wednesday May 26 ===
'''Summer exam 2021:''' Go to https://eksamen.dtu.dk/ and find 22111.

[[Media:Vejledning_til_studerende_DE_Digital_Eksamen_DK-UK.pdf|Here is a guide]] to the Digital Exam interface (in Danish and English).

The assignment will be accessible from '''XX:XX''' on Wednesday May 26.

=== Checklist for computers ===
Check here whether your computer has all the software needed for the exam: [[Checklist for computers]]

=== Link collection ===
A quick overview of the websites we have used in the course: [[Link collection]]

=== FAQ ===
Questions we have received and answered: [[FAQ]]

22111:Course plan spring 2022

2024-03-15T11:45:17Z

WikiSysop: Created page with "== General information == === Where and when === Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting '''Tuesday Feb 1 at 13:00'''. Lectures will be from 13:00 to approx. 14 in '''building 306, auditorium 32''', and the exercises will then take place in '''building 210, rooms 142+148 and 112+118'''. === Teachers === * [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=257116&tab=2&qt=dtupublicationquery He..."

== General information ==

=== Where and when ===
Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting '''Tuesday Feb 1 at 13:00'''.

Lectures will be from 13:00 to approx. 14 in '''building 306, auditorium 32''', and the exercises will then take place in '''building 210, rooms 142+148 and 112+118'''.

=== Teachers ===

* [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=257116&tab=2&qt=dtupublicationquery Henrik Nielsen] — Associate professor, course responsible.
* [https://www.dtu.dk/service/telefonbog/person?id=18103&tab=2&qt=dtupublicationquery Rasmus Wernersson] — External associate professor.
* [http://www.dtu.dk/service/telefonbog/person?id=34983&tab=2&qt=dtupublicationquery Paolo Marcatili] — Associate professor, guest lecturer. Topic: Protein structure.
* [http://www.dtu.dk/service/telefonbog/person?id=5118&tab=2&qt=dtupublicationquery Anders Gorm Pedersen] — Professor, guest lecturer. Topic: Phylogenetic trees.


=== Teaching assistants ===

* [https://www.dtu.dk/service/telefonbog/person?id=167092 Joshua Daniel Rubin] — PhD student
* [https://www.dtu.dk/service/telefonbog/Person?id=98057 Julie Zimmermann] — PhD student
* [https://www.dtu.dk/service/telefonbog/person?id=141480 Magnus Haraldson Høie] — PhD student
* [https://www.dtu.dk/service/telefonbog/Person?id=165043 Nicola Alexandra Vogel] — PhD student

=== Course content ===
In this course, a large emphasis is placed on the practical usage of bioinformatics databases and tools. A typical lecture will present the theoretical aspects of the topics of the day — sometimes including a small group exercise using pen and paper — and last about an hour. The rest of the time will be spent on practical computer exercises, where the teachers and teaching assistants will be ready to help.

See also [http://kurser.dtu.dk/course/22111 the course base about 22111].

=== Curriculum ===
There is no formal textbook. The curriculum consists of the exercise guides, supplemented with various papers and chapters which will be made available on this homepage or on DTU Learn. Please note that ''all'' exercise guides are mandatory curriculum — including the ''answers'' to the exercises which will be made available on DTU Learn after each exercise.

=== Computers ===
====Hardware====
'''You must bring your own laptop''' to the exercises, and it must be able to connect to DTU's wireless network. The type of computer / operating system is not important; Windows, Mac or Linux will all work fine. An iPad or an Android tablet, on the other hand, will not be good enough. A Chromebook will also not be enough (unless you have succeeded in installing a Linux distribution on it, but in that case we assume you know what you're doing).

In some of the exercises ("PDB/PyMOL", "Malaria vaccine", and "Old exam questions"), you will work with the molecular visualization program PyMOL. This is rather difficult to control by a touchpad, so please remember to '''bring a mouse'''. The mouse should have two buttons plus a scroll-wheel.

====Software====
# Most importantly: an updated '''internet browser''' (e.g. [http://www.google.com/chrome Google Chrome], [http://www.mozilla.com/ FireFox], [http://www.opera.com/ Opera], [https://www.microsoft.com/edge Edge] for Windows or Mac OS, or Safari for Mac only). '''NB:''' You must have more than one browser installed; Safari for Mac or Edge for Windows may have glitches with some bioinformatics websites, and in those cases it is important to be able to switch to an alternative browser.
# '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). Download from here: https://adoptium.net/ - choose Temurin JDK 17.
#* '''NOTE:''' Oracle java 8 from https://java.com/en/ is NOT good enough for jEdit anymore. jEdit version 5.6 and later needs java 11 or higher which is available from the above link (and from a few other places). Temurin JDK is open source and free, even for commercial use, which is not the case for Oracle java anymore.
#* '''IMPORTANT TIP''' for Windows users: You need to enable the sub-feature named "set JAVA_HOME variable" when installing Temurin JDK.
# A plain text editor for working with, e.g., sequence files. We recommend '''jEdit''', which you can download for free from http://www.jedit.org. If you experience unsolvable problems installing or running jEdit, there are alternatives, e.g. [http://geany.org/ Geany].
#* '''NOTE:''' The jEdit developers have not signed the installation package, therefore both Windows and MacOS will complain when you first attempt to install it, and you have to insist that it is OK to run the program. For Macs, this is a bit complicated, see the instructions in [[ExJEdit#Download_and_Install_jEdit|the exercise guide]].
Other software will be installed during the exercises.

=== Hand-ins ===
As preparation for the computer-based exam, each participant or group must write a "'''logbook'''" with answers to the questions posed in the exercise guides. After the exercise, you should upload the logbook to DTU Learn.


You decide which software you prefer for writing the logbook — e.g. Microsoft Word, [http://www.libreoffice.org/ LibreOffice] (free), [http://www.openoffice.org/ Apache OpenOffice] (free), Pages for Mac, [https://docs.google.com/ Google Docs] or similar. You should be able to insert '''screenshots''' in the logbooks for documentation purposes. Microsoft Word has a built-in screenshot tool. Both Windows 10 and Mac OS also have dedicated screenshot tools.


Regardless of your choice of writing software, the result '''must be handed in as a PDF file'''. LibreOffice and Google Docs can make PDFs directly. MacOS and Windows 10 have built-in functions for converting any printable file to PDF. Users of earlier versions of Windows must install a separate program. Several free alternatives exist, e.g. [http://www.primopdf.com/ PrimoPDF]. (It can be a good idea to install PrimoPDF even for Windows 10 users, it provides some extra options and the resulting files take up less space).

'''Please do ''not'' copy the questions''' from the exercise guide to your logbook. The hand-in module on DTU Learn has a system for plagiarism detection, which will raise an alarm if significant portions of your hand-in are identical to documents found on the internet — and that includes the exercise guides.

'''NB:''' ''The hand-ins do not affect your grade'' — they are mainly meant as a preparation for the exam. They are also a means for us to check the understanding of the teaching; if we can see that many participants have made the same mistake, we will try to explain the issue better at the next lecture.

=== Exam ===
The 22111 exam is electronic; i.e. you must bring your own computer, and you will ''not'' get a paper copy of the questions. The questions will be made available as a PDF file on the DTU online exam system.  The only accepted hand-in format is PDF.

All aids are allowed at the exam; you can bring any books, papers or notes. You will have '''open access to the internet''' which includes all the materials and websites we have used during the course. You are also allowed to search information on Google, Wikipedia, etc., but you are ''not'' allowed to communicate with others through e-mail, Facebook, chat, or file sharing websites. The internet traffic will be logged during the exam to ensure that these restrictions are kept.

Just like in the weekly hand-ins, we kindly ask you: ''Please don't copy the questions in your answer document'' — that might result in the answer being flagged as plagiarism.


=== DTU Learn & Inside ===
Link to this year's DTU Learn page: https://learn.inside.dtu.dk/d2l/home/102976

Link to this year's DTU Inside group: https://cn.inside.dtu.dk/cnnet/element/653095/frontpage

=== Evaluation and feedback ===
We will be very happy to receive comments, suggestions, criticisms, or praise at any time during the semester. You can:
* send them by email to the teachers, or
* write them under "General feedback" in "Discussion" in the DTU Inside group (found in the Course content menu)
If somebody writes a message in "Discussion", you can comment on it. If you see a message you agree on, please comment "Agree!" so that we can see that it is not just one person's opinion.

In addition, we will conduct a mid-term evaluation in [https://evaluering.dtu.dk/ DTU evaluation].

== Lecture & exercise plan ==

Note: This is a ''preliminary'' plan, changes may occur!

=== Tuesday Feb 1 — Introduction & taxonomy ===
:'''Lectures:'''
:* ''Introduction to the course, bioinformatics, and computers'' — Henrik Nielsen.
:* ''Evolution and taxonomy'' — Rasmus Wernersson.
:'''Slides:''' will be made available on DTU Learn.
:'''Curriculum:''' [https://teaching.healthtech.dtu.dk/material/36611/PDF/Chapter2_Evolution.pdf Brief Introduction to Evolutionary Theory] — Written by Anders Gorm Pedersen.
:'''Test of prior knowledge:''' Go to https://evaluering.dtu.dk/, click "Test of prior knowledge" under 22111, and fill out the form (it's anonymous). Spend max. 10 minutes on it.
:'''Exercises:'''
:# [[Plain text files and jEdit]] 
:# [[Taxonomy databases]] 
:'''Extra material'''
:*"[[Media:ELS_bioinformatics.pdf|Bioinformatics]]" — Encyclopedia entry from 2009.
:*"[https://academic.oup.com/nar/article/50/D1/D20/6447242 Database resources of the national center for biotechnology information]" — article from the annual database issue of Nucleic Acids Research, 2022

=== Tuesday Feb 8 — GenBank ===
:'''Lecture:''' ''DNA as Biological Information'' — Rasmus Wernersson
:'''Curriculum:''' [[Media:DNA_SequencingTutorial.pdf|DNA sequencing tutorial]] — source: IDT Tech Vault
:'''Handout''' for the lecture: [[Media:HandoutEx_BaseCalling_Simple.pdf‎|"Base-calling" exercise (for printing)]] [PDF] / [[Media:BaseCalling_on_screen_version.pdf|"Base-calling" exercise (version for on-screen viewing)]] [PDF].
:'''Slides:''' on DTU Learn.


:'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] 
:'''Reference material''' for the exercise: [[Media:GenBank+FASTA_handout_revised.pdf|GenBank + FASTA format]] [PDF]

:'''Background material''' (supposedly known):
:*[[File:Phone_34.gif‎]] [http://www.youtube.com/watch?v=YgmoHtLGb5c mRNA splicing] (YouTube).
:*[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
:*[http://www.ncbi.nlm.nih.gov/books/NBK44863/ Entrez Sequences Quick Start] (NCBI)
:*[https://academic.oup.com/nar/article/50/D1/D161/6447240 "GenBank"] — article from the annual database issue of Nucleic Acids Research, 2022.

=== Tuesday Feb 15 — Translation & UniProt ===
:'''Lecture:''' ''Protein databases'' — Henrik Nielsen
:'''Curriculum:''' [[Media:VirtualRibosome.pdf|Virtual Ribosome]] — software article (PDF).
:'''Slides:''' on DTU Learn.

:'''Exercises:'''
:#[[Exercise: Translation - Virtual Ribosome]] 
:#[[Exercise: The protein database UniProt]] 
:'''Background material''' (supposedly known):
:*[https://teaching.healthtech.dtu.dk/material/36611/PDF/protein_handout.pdf Levels of protein structure] [PDF]
:*[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
:*[https://academic.oup.com/nar/article/49/D1/D480/6006196 "UniProt: the universal protein knowledgebase in 2021"] — article from the annual database issue of Nucleic Acids Research, 2021.
:*[[Media:uniprotkb_quickguide.pdf|"A Quick Guide to UniProtKB"]] — nice printable overview.

=== Tuesday Feb 22 — Pairwise alignment ===
:'''Lecture:''' ''Pairwise alignment'' — Henrik Nielsen.
:'''Curriculum:''' Page 35-55 in Immunological Bioinformatics (PDF: on DTU Learn → General information and files → Textbook excerpt).
:'''Handout''' for the lecture: [[Media:New_handout_alignscores.pdf|Alignment scores]]
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] 

=== Tuesday Mar 1 — Protein structure, PDB & PyMOL ===
:'''Remember to bring a mouse for this day's exercise.''' The mouse should have two buttons and a scroll wheel.
:'''Lecture:''' ''Protein 3D structure'' — Paolo Marcatili
:'''Curriculum:''' [http://en.wikipedia.org/wiki/Protein_structure Protein Structure (Wikipedia)]
:'''Slides:''' on DTU Learn.

:'''Link to advanced course:'''
::* [http://teaching.healthtech.dtu.dk/36617/index.php/22117_-_Protein_Structure_and_Computational_Biology 22117 Protein Structure and Computational Biology]

:'''Software''' for installation: [https://pymol.org/2/ PyMOL]
::'''Note:''' you will need the license file found at DTU Learn under this week's topic.
:'''Exercises:'''
:#[[Media:PyMol_tutorial2017_v4.pdf|PyMol tutorial]] (PDF) — basic usage of PyMOL.
:#[[Protein Structure and Visualization]] 

:'''Extra material:'''
:*[https://academic.oup.com/nar/article/49/D1/D437/5992282 "RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences"] — article from the annual database issue of Nucleic Acids Research, 2021.
:*[[PyMOL]] — some tips and tricks.
:*[https://teaching.healthtech.dtu.dk/material/36611/PDF/PyMOL_structure_navigation.pdf PyMOL basics — a small example] (optional extra exercise)

=== Tuesday Mar 8 — Case: Malaria vaccine ===
:'''Lecture:''' ''Malaria and vaccines'' — [https://cmp.ku.dk/staff/?pure=en/persons/226923 Thomas Lavstsen], Associate Professor, University of Copenhagen
:'''Curriculum:''' [http://www.cdc.gov/dpdx/malaria/ Malaria — Causal Agents / Life Cycle]
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[Exercise:Malaria Vaccine|Malaria vaccine]] 

=== Tuesday Mar 15 — BLAST ===
:'''Lecture:''' ''Introduction to BLAST'' — Rasmus Wernersson.
:'''Curriculum:''' section 3.2.5 → 3.3 (i.e. pages 47-52) in Immunological Bioinformatics (PDF: on DTU Learn).
:'''Slides:''' on DTU Learn.
:'''[[Exercise: BLAST]]''' 
:'''Extra material:'''
::[[File:Phone_34.gif‎]] '''Videos about BLAST from NCBI:''' (Video introduction to NCBI's web interface and Expect Values) [http://www.youtube.com/playlist?list=PLH-TjWpFfWrtjzMCIvUe-YbrlIeFQlKMq NCBI's YouTube channel]

=== Tuesday Mar 22 — Sequence information & logo-plots ===
:'''Lecture:''' ''Sequence information & logo-plots'' — Rasmus Wernersson
:'''Curriculum:'''
:# Pages 68-80 in Immunological Bioinformatics (PDF: on DTU Learn).
:# Pages 1-8 of "''Information theory primer''" ([https://teaching.healthtech.dtu.dk/material/36611/PDF/informationtheory_primer.pdf PDF])
:#* Read also the appendix on logarithms (especially log2) if needed!
:'''Slides:''' on DTU Learn.
:'''Handout''' for the lecture: [[Media:logo_exercise.pdf|How to construct sequence logos]] (PDF)
:[[Image:Emblem-important_tiny.png‎]] '''Mid-term evaluation:''' Go to https://evaluering.dtu.dk/ and click "Mid-term evaluation" under 22111 [[Image:Emblem-important_tiny.png‎]]
:'''Exercise:''' [[ExSeqLogos|DNA and Peptide Logos]] 

=== Tuesday Mar 29 — Weight matrices and other prediction methods ===
:'''Lecture:''' ''Introduction to prediction methods, especially Weight Matrices'' — Henrik Nielsen
:'''Curriculum:''' Same as last week!
:'''Slides:''' on DTU Learn.
:'''Handouts''' for the lecture: [[Media:Estimationofpseudocounts_new+examples.pdf|How to estimate pseudo frequencies]] 
:[[Image:Emblem-important_tiny.png‎]] '''Mid-term evaluation:''' Go to https://evaluering.dtu.dk/ and click "Mid-term evaluation" under 22111 [[Image:Emblem-important_tiny.png‎]]
:'''Exercise:''' [[Exercise: Construction of sequence logos and weight matrices|Construction of weight matrices]] 
:'''Link to advanced course: '''
:: [http://teaching.healthtech.dtu.dk/22125/ 22125: Algorithms in bioinformatics]

=== Tuesday Apr 5 — PSI-BLAST ===
:'''Lecture:''' ''PSI-BLAST'' — Rasmus Wernersson
:'''Curriculum:'''
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] 

------
<div align="center">
[[Image:Easter-egg-free-to-use-cliparts.png|25px]] '''Easter holidays''' [[Image:Easter-egg-free-to-use-cliparts.png|25px]]
</div>
------

=== Tuesday Apr 19 — Multiple alignments ===
:'''Lecture:''' ''Multiple alignment'' — Henrik Nielsen
:'''Curriculum:''' RevTrans ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC169015/ article])
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] 

=== Tuesday Apr 26 — Phylogenetic trees ===
:'''Lecture:''' ''Phylogenetic Reconstruction: Distance Matrix Methods'' — Anders Gorm Pedersen
:'''Extra lecture:''' ''Bioinformatics and Systems Biology in precision medicine'' — Rasmus Wernersson
:'''Curriculum:'''
:# ''Introduction to Tree Building'', PDF on Learn 
:# ''[http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01 Evolutionary trees]'' (minus the section "How to reconstruct an evolutionary tree")
:# ''Understanding Evolutionary Trees'', [https://teaching.healthtech.dtu.dk/material/36611/PDF/understanding_evo_trees.pdf PDF].
:'''Slides:''' on DTU Learn.
:'''Handout''' for lecture: [https://teaching.healthtech.dtu.dk/material/36611/PDF/handout_distance.pdf Reconstructing a distance tree] 
:'''Software''' for installation: [https://github.com/rambaut/figtree/releases FigTree tree-viewer]
::'''IMPORTANT NOTE''' for Windows users: Download the <tt>.zip</tt> file (FigTree.v1.4.4.zip) and unpack it. Then, go to the "lib" subfolder and double-click the <tt>.jar</tt> file. The <tt>.exe</tt> file may not work.
:'''TEST''' of the internal webserver we are going to use during the exercise: Please go to https://services.healthtech.dtu.dk/service.php?TreeHugger and click "View example alignment files". Then, copy either the "Sample DNA alignment" or the "Sample peptide dataset" and paste it in the TreeHugger input field. Click Submit query when instructed by the lecturer.
:'''[[Exercise: Phylogeny]]''' 
:'''Link to advanced course:'''
::* [http://teaching.healthtech.dtu.dk/22115/ 22115 Computational Molecular Evolution]

=== Tuesday May 3 — Bioinformatics in practice + old exam questions ===
:'''Lecture:''' "Real life case": ''AI, phage discovery and rainforest genomics'' — Bent Petersen, KU.
:'''Curriculum:''' (None - lean back and enjoy)
:'''Slides:''' on DTU Learn.
:'''Exercise:''' We train on the old exam set from '''2020''' - available on DTU Learn.

== Exam ==

=== Friday May 20 ===
'''Summer exam 2022:''' Go to https://eksamen.dtu.dk/ and find 22111.

[[Media:Vejledning-til-digital-eksamen-DE-DKENG_0322.pdf|Here is a guide]] to the Digital Exam interface (in Danish and English).

The assignment will be accessible from '''09:00''' on Friday May 20.

=== Checklist for computers ===
Check here whether your computer has all the software needed for the exam: [[Checklist for computers]]

=== Link collection ===
A quick overview of the websites we have used in the course: [[Link collection]]

=== FAQ ===
Questions we have received and answered: [[FAQ]]

22111:Course plan autumn 2022

2024-03-15T11:41:11Z

WikiSysop: Created page with "== General information == === Where and when === Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting '''Tuesday Aug 30 at 13:00'''. Lectures will be from 13:00 to approx. 14 in '''building 306, auditorium 32''', and the exercises will then take place in '''building 210, rooms 112+118, 142+148, and 162'''. === Teachers === * [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=257116&tab=2&qt=dtupublicationqu..."

== General information ==

=== Where and when ===
Lectures plus subsequent exercises will take place every Tuesday afternoon during the semester, starting '''Tuesday Aug 30 at 13:00'''.

Lectures will be from 13:00 to approx. 14 in '''building 306, auditorium 32''', and the exercises will then take place in '''building 210, rooms 112+118, 142+148, and 162'''.

=== Teachers ===

* [http://www.dtu.dk/service/telefonbog/person?id=25617&cpid=257116&tab=2&qt=dtupublicationquery Henrik Nielsen] — Associate professor, course responsible.
* [https://www.dtu.dk/Person/cwis?id=142840&entity=profile Carolina Barra Quaglia] — Assistant professor, course responsible.
* [https://www.dtu.dk/service/telefonbog/person?id=18103&tab=2&qt=dtupublicationquery Rasmus Wernersson] — External associate professor.
* [http://www.dtu.dk/service/telefonbog/person?id=5118&tab=2&qt=dtupublicationquery Anders Gorm Pedersen] — Professor, guest lecturer. Topic: Phylogenetic trees.

=== Teaching assistants ===

* [https://www.dtu.dk/person/louis-maximilian-kraft?id=183582 Louis Kraft] — PhD student
* [https://www.dtu.dk/service/telefonbog/person?id=174793 Niels Rasmus Lorenzen] — PhD student
* [https://www.dtu.dk/service/telefonbog/Person?id=143970 Yuchen Li] — PhD student

=== Course content ===
In this course, a large emphasis is placed on the practical usage of bioinformatics databases and tools. A typical lecture will present the theoretical aspects of the topics of the day — sometimes including a small group exercise using pen and paper — and last about an hour. The rest of the time will be spent on practical computer exercises, where the teachers and teaching assistants will be ready to help.

See also [http://kurser.dtu.dk/course/22111 the course base about 22111].

=== Curriculum ===
There is no formal textbook. The curriculum consists of the exercise guides, supplemented with various papers and chapters which will be made available on this homepage or on DTU Learn. Please note that ''all'' exercise guides are mandatory curriculum — including the ''answers'' to the exercises which will be made available on DTU Learn after each exercise.

=== Computers ===
====Hardware====
'''You must bring your own laptop''' to the exercises, and it must be able to connect to DTU's wireless network. The type of computer / operating system is not important; Windows, Mac or Linux will all work fine. An iPad or an Android tablet, on the other hand, will not be good enough. A Chromebook will also not be enough (unless you have succeeded in installing a Linux distribution on it, but in that case we assume you know what you're doing).

In some of the exercises ("PDB/PyMOL", "Malaria vaccine", and "Old exam questions"), you will work with the molecular visualization program PyMOL. This is rather difficult to control by a touchpad, so please remember to '''bring a mouse'''. The mouse should have two buttons plus a scroll-wheel.

====Software====
# Most importantly: an updated '''internet browser''' (e.g. [http://www.google.com/chrome Google Chrome], [http://www.mozilla.com/ FireFox], [http://www.opera.com/ Opera], [https://www.microsoft.com/edge Edge] for Windows or Mac OS, or Safari for Mac only). '''NB:''' You must have more than one browser installed; Safari for Mac or Edge for Windows may have glitches with some bioinformatics websites, and in those cases it is important to be able to switch to an alternative browser.
# '''Java''' runtime engine is needed for running some of the software we use in the course, including jEdit (see below). Download from here: https://www.oracle.com/java/technologies/downloads/#jdk17 (choose java 17, not 18 or 19, and select your type of computer) '''or''' from here: https://adoptium.net/ (choose Temurin JDK 17).
#* '''NOTE:''' Do NOT download java from https://java.com/ — that will give you Oracle java 8, which is NOT good enough for jEdit anymore. jEdit version 5.6 and later needs java 11 or higher which is available from the above links (and from a few other places).
#* '''IMPORTANT TIP''' for Windows users: You need to enable the sub-feature named "set JAVA_HOME variable" when installing Temurin JDK.
# A plain text editor for working with, e.g., sequence files. We recommend '''jEdit''', which you can download for free from http://www.jedit.org. If you experience unsolvable problems installing or running jEdit, there are alternatives, e.g. [http://geany.org/ Geany].
#* '''NOTE:''' The jEdit developers have not signed the installation package, therefore both Windows and MacOS will complain when you first attempt to install it, and you have to insist that it is OK to run the program. For Macs, this is a bit complicated, see the instructions in [[ExJEdit#Download_and_Install_jEdit|the exercise guide]].
Other software will be installed during the exercises.

=== Hand-ins ===
As preparation for the computer-based exam, each participant or group must write a "'''logbook'''" with answers to the questions posed in the exercise guides. After the exercise, you should upload the logbook to DTU Learn.


You decide which software you prefer for writing the logbook — e.g. Microsoft Word, [http://www.libreoffice.org/ LibreOffice] (free), [http://www.openoffice.org/ Apache OpenOffice] (free), Pages for Mac, [https://docs.google.com/ Google Docs] or similar. You should be able to insert '''screenshots''' in the logbooks for documentation purposes. Microsoft Word has a built-in screenshot tool. Both Windows 10 and Mac OS also have dedicated screenshot tools.


Regardless of your choice of writing software, the result '''must be handed in as a PDF file'''. LibreOffice and Google Docs can make PDFs directly. MacOS and Windows 10 have built-in functions for converting any printable file to PDF. Users of earlier versions of Windows must install a separate program. Several free alternatives exist, e.g. [http://www.primopdf.com/ PrimoPDF]. (It can be a good idea to install PrimoPDF even for Windows 10 users, it provides some extra options and the resulting files take up less space).

'''Please do ''not'' copy the questions''' from the exercise guide to your logbook. The hand-in module on DTU Learn has a system for plagiarism detection, which will raise an alarm if significant portions of your hand-in are identical to documents found on the internet — and that includes the exercise guides.

'''NB:''' ''The hand-ins do not affect your grade'' — they are mainly meant as a preparation for the exam. They are also a means for us to check the understanding of the teaching; if we can see that many participants have made the same mistake, we will try to explain the issue better at the next lecture.

=== Exam ===
The 22111 exam is electronic; i.e. you must bring your own computer, and you will ''not'' get a paper copy of the questions. The questions will be made available as a PDF file on the DTU online exam system.  The only accepted hand-in format is PDF.

All aids are allowed at the exam; you can bring any books, papers or notes. You will have '''open access to the internet''' which includes all the materials and websites we have used during the course. You are also allowed to search information on Google, Wikipedia, etc., but you are ''not'' allowed to communicate with others through e-mail, Facebook, chat, or file sharing websites. The internet traffic will be logged during the exam to ensure that these restrictions are kept.

Just like in the weekly hand-ins, we kindly ask you: ''Please don't copy the questions in your answer document'' — that might result in the answer being flagged as plagiarism.


=== DTU Learn & Inside ===
Link to this year's DTU Learn page: https://learn.inside.dtu.dk/d2l/home/125911

Link to this year's DTU Inside group: https://cn.inside.dtu.dk/cnnet/element/663441

=== Evaluation and feedback ===
We will be very happy to receive comments, suggestions, criticisms, or praise at any time during the semester. You can:
* send them by email to the teachers, or
* write them under "General feedback" in "Discussion" in the DTU Inside group (found in the Course content menu)
If somebody writes a message in "Discussion", you can comment on it. If you see a message you agree on, please comment "Agree!" so that we can see that it is not just one person's opinion.

In addition, we will conduct a mid-term evaluation in [https://evaluering.dtu.dk/ DTU evaluation].

== Lecture & exercise plan ==

Note: This is a ''preliminary'' plan, changes may occur!

=== Tuesday Aug 30 — Introduction & taxonomy ===
:'''Lectures:'''
:* ''Introduction to the course, bioinformatics, and computers'' — Henrik Nielsen.
:* ''Evolution and taxonomy'' — Rasmus Wernersson.
:'''Slides:''' will be made available on DTU Learn.
:'''Curriculum:''' [https://teaching.healthtech.dtu.dk/material/36611/PDF/Chapter2_Evolution.pdf Brief Introduction to Evolutionary Theory] — Written by Anders Gorm Pedersen.
:'''Test of prior knowledge:''' Go to https://evaluering.dtu.dk/, click "Test of prior knowledge" under 22111, and fill out the form (it's anonymous). Spend max. 10 minutes on it.
:'''Exercises:'''
:# [[Plain text files and jEdit]] 
:# [[Taxonomy databases]] 
:'''Extra material'''
:*"[[Media:ELS_bioinformatics.pdf|Bioinformatics]]" — Encyclopedia entry from 2009.
:*"[https://academic.oup.com/nar/article/50/D1/D20/6447242 Database resources of the national center for biotechnology information]" — article from the annual database issue of Nucleic Acids Research, 2022

=== Tuesday Sep 6 — GenBank ===
:'''Lecture:''' ''DNA as Biological Information'' — Carolina Barra Quaglia
:'''Curriculum:''' [[Media:DNA_SequencingTutorial.pdf|DNA sequencing tutorial]] — source: IDT Tech Vault
:'''Handout''' for the lecture: [[Media:HandoutEx_BaseCalling_Simple.pdf‎|"Base-calling" exercise (for printing)]] [PDF] / [[Media:BaseCalling_on_screen_version.pdf|"Base-calling" exercise (version for on-screen viewing)]] [PDF].
:'''Slides:''' on DTU Learn.


:'''Exercise:''' [[ExGenbank-new|Using the GenBank database]] 
:'''Reference material''' for the exercise: [[Media:GenBank+FASTA_handout_revised.pdf|GenBank + FASTA format]] [PDF]

:'''Background material''' (supposedly known):
:*[[File:Phone_34.gif‎]] [http://www.youtube.com/watch?v=YgmoHtLGb5c mRNA splicing] (YouTube).
:*[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
:*[http://www.ncbi.nlm.nih.gov/books/NBK44863/ Entrez Sequences Quick Start] (NCBI)
:*[https://academic.oup.com/nar/article/50/D1/D161/6447240 "GenBank"] — article from the annual database issue of Nucleic Acids Research, 2022.

=== Tuesday Sep 13 — Translation & UniProt ===
:'''Lecture:''' ''Protein databases'' — Henrik Nielsen
:'''Curriculum:''' [[Media:VirtualRibosome.pdf|Virtual Ribosome]] — software article (PDF).
:'''Slides:''' on DTU Learn.

:'''Exercises:'''
:#[[Exercise: Translation - Virtual Ribosome]] 
:#[[Exercise: The protein database UniProt]] 
:'''Background material''' (supposedly known):
:*[https://teaching.healthtech.dtu.dk/material/36611/PDF/protein_handout.pdf Levels of protein structure] [PDF]
:*[[Media:GeneStructure.pdf|Overview of eukaryotic gene structure]] (PDF).

:'''Extra material:'''
:*[https://academic.oup.com/nar/article/49/D1/D480/6006196 "UniProt: the universal protein knowledgebase in 2021"] — article from the annual database issue of Nucleic Acids Research, 2021.
:*[[Media:uniprotkb_quickguide.pdf|"A Quick Guide to UniProtKB"]] — nice printable overview.

=== Tuesday Sep 20 — Pairwise alignment ===
:'''Lecture:''' ''Pairwise alignment'' — Henrik Nielsen.
:'''Curriculum:''' Page 35-55 in Immunological Bioinformatics (PDF: on DTU Learn → General information and files → Textbook excerpt).
:'''Handout''' for the lecture: [[Media:New_handout_alignscores.pdf|Alignment scores]]
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[ExPairwiseAlignment|Pairwise alignment]] 

=== Tuesday Sep 27 — Protein structure, PDB & PyMOL ===
:'''Remember to bring a mouse for this day's exercise.''' The mouse should have two buttons and a scroll wheel.
:'''Lecture:''' ''Protein 3D structure'' — Carolina Barra Quaglia
:'''Curriculum:''' [http://en.wikipedia.org/wiki/Protein_structure Protein Structure (Wikipedia)]
:'''Slides:''' on DTU Learn.

:'''Link to advanced course:'''
::* [http://teaching.healthtech.dtu.dk/36617/index.php/22117_-_Protein_Structure_and_Computational_Biology 22117 Protein Structure and Computational Biology]

:'''Software''' for installation: [https://pymol.org/2/ PyMOL]
::'''Note:''' you will need the license file found at DTU Learn under this week's topic.
:'''Exercises:'''
:#[[Media:PyMol_tutorial2017_v4.pdf|PyMol tutorial]] (PDF) — basic usage of PyMOL.
:#[[Protein Structure and Visualization]] 

:'''Extra material:'''
:*[https://academic.oup.com/nar/article/49/D1/D437/5992282 "RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences"] — article from the annual database issue of Nucleic Acids Research, 2021.
:*[[PyMOL]] — some tips and tricks.
:*[https://teaching.healthtech.dtu.dk/material/36611/PDF/PyMOL_structure_navigation.pdf PyMOL basics — a small example] (optional extra exercise)

=== Tuesday Oct 4 — Case: Malaria vaccine ===
:'''Lecture:''' ''Malaria and vaccines'' — [https://cmp.ku.dk/staff/?pure=en/persons/226923 Thomas Lavstsen], Associate Professor, University of Copenhagen
:'''Curriculum:''' [http://www.cdc.gov/dpdx/malaria/ Malaria — Causal Agents / Life Cycle]
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[Exercise:Malaria Vaccine|Malaria vaccine]] 

=== Tuesday Oct 11 — BLAST ===
:'''Lecture:''' ''Introduction to BLAST'' — Rasmus Wernersson.
:'''Curriculum:''' section 3.2.5 → 3.3 (i.e. pages 47-52) in Immunological Bioinformatics (PDF: on DTU Learn).
:'''Slides:''' on DTU Learn.
:[[Image:Emblem-important_tiny.png‎]] '''Mid-term evaluation:''' Go to https://evaluering.dtu.dk/ and click "Mid-term evaluation" under 22111 [[Image:Emblem-important_tiny.png‎]]
:'''[[Exercise: BLAST]]''' 
:'''Extra material:'''
::[[File:Phone_34.gif‎]] '''Videos about BLAST from NCBI:''' (Video introduction to NCBI's web interface and Expect Values) [http://www.youtube.com/playlist?list=PLH-TjWpFfWrtjzMCIvUe-YbrlIeFQlKMq NCBI's YouTube channel]

------
<div align="center">
'''Autumn holiday'''
</div>
------

=== Tuesday Oct 25 — Sequence information & logo-plots ===
:'''Lecture:''' ''Sequence information & logo-plots'' — Rasmus Wernersson
:'''Curriculum:'''
:# Pages 68-80 in Immunological Bioinformatics (PDF: on DTU Learn).
:# Pages 1-8 of "''Information theory primer''" ([https://teaching.healthtech.dtu.dk/material/36611/PDF/informationtheory_primer.pdf PDF])
:#* Read also the appendix on logarithms (especially log2) if needed!
:'''Slides:''' on DTU Learn.
:'''Handout''' for the lecture: [[Media:logo_exercise.pdf|How to construct sequence logos]] (PDF)
:'''Exercise:''' [[ExSeqLogos|DNA and Peptide Logos]] 

=== Tuesday Nov 1 — Weight matrices and other prediction methods ===
:'''Lecture:''' ''Introduction to prediction methods, especially Weight Matrices'' — Henrik Nielsen
:'''Curriculum:''' Same as last week!
:'''Slides:''' on DTU Learn.
:'''Handouts''' for the lecture: [[Media:Estimationofpseudocounts_new+examples.pdf|How to estimate pseudo frequencies]] 
:'''Exercise:''' [[Exercise: Construction of sequence logos and weight matrices|Construction of weight matrices]] 
:'''Link to advanced course: '''
:: [http://teaching.healthtech.dtu.dk/22125/ 22125: Algorithms in bioinformatics]

=== Tuesday Nov 8 — PSI-BLAST ===
:'''Lecture:''' ''PSI-BLAST'' — Carolina Barra Quaglia
:'''Curriculum:'''
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[ExPSIBLAST|PSI-BLAST]] 

=== Tuesday Nov 15 — Multiple alignments ===
:'''Lecture:''' ''Multiple alignment'' — Henrik Nielsen
:'''Curriculum:''' RevTrans ([https://www.ncbi.nlm.nih.gov/pmc/articles/PMC169015/ article])
:'''Slides:''' on DTU Learn.
:'''Exercise:''' [[Exercise: Multiple Alignments (English version)|Multiple Alignments]] 

=== Tuesday Nov 22 — Phylogenetic trees ===
:'''Lecture:''' ''Phylogenetic Reconstruction: Distance Matrix Methods'' — Anders Gorm Pedersen

:'''Curriculum:'''
:# ''Introduction to Tree Building'', PDF on Learn 
:# ''[http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01 Evolutionary trees]'' (minus the section "How to reconstruct an evolutionary tree")
:# ''Understanding Evolutionary Trees'', [https://teaching.healthtech.dtu.dk/material/36611/PDF/understanding_evo_trees.pdf PDF].
:'''Slides:''' on DTU Learn.
:'''Handout''' for lecture: [https://teaching.healthtech.dtu.dk/material/36611/PDF/handout_distance.pdf Reconstructing a distance tree] 
:'''Software''' for installation: [https://github.com/rambaut/figtree/releases FigTree tree-viewer]
::'''IMPORTANT NOTE''' for Windows users: Download the <tt>.zip</tt> file (FigTree.v1.4.4.zip) and unpack it. Then, go to the "lib" subfolder and double-click the <tt>.jar</tt> file. The <tt>.exe</tt> file may not work.

:'''[[Exercise: Phylogeny]]''' 
:'''Link to advanced course:'''
::* [http://teaching.healthtech.dtu.dk/22115/ 22115 Computational Molecular Evolution]

=== Tuesday Nov 29 — Bioinformatics in practice + old exam questions ===
:'''Lecture:''' ''Bioinformatics and Systems Biology in precision medicine'' — Rasmus Wernersson 
:'''Curriculum:''' (None - lean back and enjoy)
:'''Slides:''' on DTU Learn.
:'''Exercise:''' We train on the old exam set from '''spring 2022''' - available on DTU Learn. Note that there is no hand-in. The answers will become available 17:00 on Tuesday Nov 29.

== Exam ==

=== Thursday Dec 15 ===
'''Winter exam 2022:''' Go to https://eksamen.dtu.dk/ and find 22111.

[[Media:Vejledning-til-digital-eksamen-DE-DKENG_0322.pdf|Here is a guide]] to the Digital Exam interface (in Danish and English).

The assignment will be accessible from '''15:00''' on Thursday Dec 15.

=== Checklist for computers ===
Check here whether your computer has all the software needed for the exam: [[Checklist for computers]]

=== Link collection ===
A quick overview of the websites we have used in the course: [[Link collection]]

=== FAQ ===
Questions we have received and answered: [[FAQ]]

Exercise: Phylogeny - Answers (Seaview version)

2024-03-15T11:38:17Z

WikiSysop: /* Step 6 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/22111/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 53 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 26 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_34.fasta.txt Ribosomal_proteins_34.fasta.txt]

==Step 10==
Open the FASTA file with the 34 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular". Here is the result:

[[File:Ribosomal_proteins_34-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_34-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-Seaview.png]]

==Step 12==
Here is the rerooted tree made by iTOL:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]

Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.

==Step 13==
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

Exercise: Phylogeny - Answers (Seaview version)

2024-03-15T11:37:54Z

WikiSysop: /* Step 9 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/36611/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 53 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 26 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [https://teaching.healthtech.dtu.dk/material/22111/Ribosomal_proteins_34.fasta.txt Ribosomal_proteins_34.fasta.txt]

==Step 10==
Open the FASTA file with the 34 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular". Here is the result:

[[File:Ribosomal_proteins_34-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_34-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-Seaview.png]]

==Step 12==
Here is the rerooted tree made by iTOL:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]

Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.

==Step 13==
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

File:Ribosomal proteins 34-NJ tree.annotated-iTOL.png

2024-03-15T11:36:47Z

WikiSysop:

File:Ribosomal proteins 34-NJ tree.rerooted-iTOL.png

2024-03-15T11:36:21Z

WikiSysop:

File:Ribosomal proteins 34-NJ tree.rerooted-Seaview.png

2024-03-15T11:36:02Z

WikiSysop:

File:Ribosomal proteins 34-NJ tree.unrooted.png

2024-03-15T11:35:39Z

WikiSysop:

File:L18 Common Taxonomy Tree.png

2024-03-15T11:34:57Z

WikiSysop:

File:L18 CDS-NJ tree.revtrans.wgaps.png

2024-03-15T11:34:31Z

WikiSysop:

File:L18 CDS-NJ tree.revtrans.nogaps.png

2024-03-15T11:34:11Z

WikiSysop:

File:Pol21-NJ tree.swapped.png

2024-03-15T11:33:42Z

WikiSysop:

File:Pol21-NJ tree.unrooted.png

2024-03-15T11:33:18Z

WikiSysop:

File:Pol21-NJ tree.png

2024-03-15T11:32:56Z

WikiSysop:

Exercise: Phylogeny - Answers (Seaview version)

2024-03-15T11:32:17Z

WikiSysop: /* Step 2 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
This is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/36611/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 53 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 26 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [[Media:Ribosomal_proteins_34.fasta.txt]]

==Step 10==
Open the FASTA file with the 34 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular". Here is the result:

[[File:Ribosomal_proteins_34-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_34-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-Seaview.png]]

==Step 12==
Here is the rerooted tree made by iTOL:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]

Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.

==Step 13==
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

Exercise: Phylogeny - Answers (Seaview version)

2024-03-15T11:31:55Z

WikiSysop: /* Step 2 */

== Step 1 ==
[https://teaching.healthtech.dtu.dk/material/22111/Pol21.aligned.pdf Here] is a PDF with the aligned sequences.

==Step 2==
[[Media:Pol21.dist.txt|Here]] is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.
<pre>
#distances order: d(1,2),...,d(1,n) <new line> d(2,3),...,d(2,n) <new line>...
20
0.750305 0.751523 0.75 0.752741 0.752741 0.752741 0.750305 0.750305 0.752741 0.749086 0.741778 0.747868 0.749086 0.744214 0.750305 0.747868 0.747868 0.747868 0.74665
0.0158343 0.0414634 0.0304507 0.043849 0.0341048 0.0170524 0.0803898 0.045067 0.399513 0.399513 0.389769 0.393423 0.394641 0.389769 0.394641 0.130329 0.389769 0.389769
0.0402439 0.0292326 0.0414129 0.0328867 0.00974421 0.0803898 0.0426309 0.399513 0.401949 0.392205 0.393423 0.394641 0.389769 0.394641 0.129111 0.388551 0.388551
0.0365854 0.0512195 0.0365854 0.0439024 0.0865854 0.054878 0.4 0.40122 0.396341 0.392683 0.395122 0.392683 0.397561 0.130488 0.392683 0.392683
0.0341048 0.0304507 0.0316687 0.0791717 0.0389769 0.397077 0.399513 0.389769 0.390987 0.392205 0.389769 0.392205 0.127893 0.387333 0.387333
0.043849 0.043849 0.0767357 0.0219245 0.390987 0.394641 0.386114 0.386114 0.388551 0.387333 0.389769 0.125457 0.386114 0.386114
0.0365408 0.0767357 0.047503 0.394641 0.397077 0.388551 0.388551 0.389769 0.386114 0.390987 0.131547 0.388551 0.388551
0.0828258 0.045067 0.401949 0.404385 0.394641 0.394641 0.397077 0.390987 0.393423 0.130329 0.388551 0.388551
0.0767357 0.398295 0.403167 0.392205 0.395859 0.394641 0.394641 0.397077 0.137637 0.400731 0.399513
0.393423 0.397077 0.387333 0.388551 0.389769 0.389769 0.389769 0.125457 0.388551 0.388551
0.0816078 0.0694275 0.0645554 0.0511571 0.0682095 0.0657734 0.392205 0.125457 0.120585
0.0511571 0.0840438 0.088916 0.09257 0.0864799 0.397077 0.131547 0.129111
0.0779537 0.0730816 0.0791717 0.0767357 0.394641 0.127893 0.121803
0.0645554 0.0633374 0.0572473 0.392205 0.118149 0.112058
0.0682095 0.0621194 0.386114 0.120585 0.118149
0.0657734 0.389769 0.126675 0.123021
0.394641 0.116931 0.115713
0.388551 0.388551
0.0146163
HTLV HIV1B5 HIV1H2 HIV1MN HIV1N5 HIV1ND HIV1OY HIV1PV HIV1U4 HIV1Z2 HIV2CA HIV2D1 HIV2G1 HIV2KR HIV2RO HIV2SB HIV2ST SIVCZ Smanga_S4 Smanga_SP

#pairwise distances
HIV1B5,HTLV: 0.750305
HIV1H2,HTLV: 0.751523
HIV1MN,HTLV: 0.75
HIV1N5,HTLV: 0.752741
HIV1ND,HTLV: 0.752741
HIV1OY,HTLV: 0.752741
HIV1PV,HTLV: 0.750305
HIV1U4,HTLV: 0.750305
HIV1Z2,HTLV: 0.752741
HIV2CA,HTLV: 0.749086
HIV2D1,HTLV: 0.741778
HIV2G1,HTLV: 0.747868
HIV2KR,HTLV: 0.749086
HIV2RO,HTLV: 0.744214
HIV2SB,HTLV: 0.750305
HIV2ST,HTLV: 0.747868
HTLV,SIVCZ: 0.747868
HTLV,Smanga_S4: 0.747868
HTLV,Smanga_SP: 0.74665
HIV1B5,HIV1H2: 0.0158343
HIV1B5,HIV1MN: 0.0414634
HIV1B5,HIV1N5: 0.0304507
HIV1B5,HIV1ND: 0.043849
HIV1B5,HIV1OY: 0.0341048
HIV1B5,HIV1PV: 0.0170524
HIV1B5,HIV1U4: 0.0803898
HIV1B5,HIV1Z2: 0.045067
HIV1B5,HIV2CA: 0.399513
HIV1B5,HIV2D1: 0.399513
HIV1B5,HIV2G1: 0.389769
HIV1B5,HIV2KR: 0.393423
HIV1B5,HIV2RO: 0.394641
HIV1B5,HIV2SB: 0.389769
HIV1B5,HIV2ST: 0.394641
HIV1B5,SIVCZ: 0.130329
HIV1B5,Smanga_S4: 0.389769
HIV1B5,Smanga_SP: 0.389769
HIV1H2,HIV1MN: 0.0402439
HIV1H2,HIV1N5: 0.0292326
HIV1H2,HIV1ND: 0.0414129
HIV1H2,HIV1OY: 0.0328867
HIV1H2,HIV1PV: 0.00974421
HIV1H2,HIV1U4: 0.0803898
HIV1H2,HIV1Z2: 0.0426309
HIV1H2,HIV2CA: 0.399513
HIV1H2,HIV2D1: 0.401949
HIV1H2,HIV2G1: 0.392205
HIV1H2,HIV2KR: 0.393423
HIV1H2,HIV2RO: 0.394641
HIV1H2,HIV2SB: 0.389769
HIV1H2,HIV2ST: 0.394641
HIV1H2,SIVCZ: 0.129111
HIV1H2,Smanga_S4: 0.388551
HIV1H2,Smanga_SP: 0.388551
HIV1MN,HIV1N5: 0.0365854
HIV1MN,HIV1ND: 0.0512195
HIV1MN,HIV1OY: 0.0365854
HIV1MN,HIV1PV: 0.0439024
HIV1MN,HIV1U4: 0.0865854
HIV1MN,HIV1Z2: 0.054878
HIV1MN,HIV2CA: 0.4
HIV1MN,HIV2D1: 0.40122
HIV1MN,HIV2G1: 0.396341
HIV1MN,HIV2KR: 0.392683
HIV1MN,HIV2RO: 0.395122
HIV1MN,HIV2SB: 0.392683
HIV1MN,HIV2ST: 0.397561
HIV1MN,SIVCZ: 0.130488
HIV1MN,Smanga_S4: 0.392683
HIV1MN,Smanga_SP: 0.392683
HIV1N5,HIV1ND: 0.0341048
HIV1N5,HIV1OY: 0.0304507
HIV1N5,HIV1PV: 0.0316687
HIV1N5,HIV1U4: 0.0791717
HIV1N5,HIV1Z2: 0.0389769
HIV1N5,HIV2CA: 0.397077
HIV1N5,HIV2D1: 0.399513
HIV1N5,HIV2G1: 0.389769
HIV1N5,HIV2KR: 0.390987
HIV1N5,HIV2RO: 0.392205
HIV1N5,HIV2SB: 0.389769
HIV1N5,HIV2ST: 0.392205
HIV1N5,SIVCZ: 0.127893
HIV1N5,Smanga_S4: 0.387333
HIV1N5,Smanga_SP: 0.387333
HIV1ND,HIV1OY: 0.043849
HIV1ND,HIV1PV: 0.043849
HIV1ND,HIV1U4: 0.0767357
HIV1ND,HIV1Z2: 0.0219245
HIV1ND,HIV2CA: 0.390987
HIV1ND,HIV2D1: 0.394641
HIV1ND,HIV2G1: 0.386114
HIV1ND,HIV2KR: 0.386114
HIV1ND,HIV2RO: 0.388551
HIV1ND,HIV2SB: 0.387333
HIV1ND,HIV2ST: 0.389769
HIV1ND,SIVCZ: 0.125457
HIV1ND,Smanga_S4: 0.386114
HIV1ND,Smanga_SP: 0.386114
HIV1OY,HIV1PV: 0.0365408
HIV1OY,HIV1U4: 0.0767357
HIV1OY,HIV1Z2: 0.047503
HIV1OY,HIV2CA: 0.394641
HIV1OY,HIV2D1: 0.397077
HIV1OY,HIV2G1: 0.388551
HIV1OY,HIV2KR: 0.388551
HIV1OY,HIV2RO: 0.389769
HIV1OY,HIV2SB: 0.386114
HIV1OY,HIV2ST: 0.390987
HIV1OY,SIVCZ: 0.131547
HIV1OY,Smanga_S4: 0.388551
HIV1OY,Smanga_SP: 0.388551
HIV1PV,HIV1U4: 0.0828258
HIV1PV,HIV1Z2: 0.045067
HIV1PV,HIV2CA: 0.401949
HIV1PV,HIV2D1: 0.404385
HIV1PV,HIV2G1: 0.394641
HIV1PV,HIV2KR: 0.394641
HIV1PV,HIV2RO: 0.397077
HIV1PV,HIV2SB: 0.390987
HIV1PV,HIV2ST: 0.393423
HIV1PV,SIVCZ: 0.130329
HIV1PV,Smanga_S4: 0.388551
HIV1PV,Smanga_SP: 0.388551
HIV1U4,HIV1Z2: 0.0767357
HIV1U4,HIV2CA: 0.398295
HIV1U4,HIV2D1: 0.403167
HIV1U4,HIV2G1: 0.392205
HIV1U4,HIV2KR: 0.395859
HIV1U4,HIV2RO: 0.394641
HIV1U4,HIV2SB: 0.394641
HIV1U4,HIV2ST: 0.397077
HIV1U4,SIVCZ: 0.137637
HIV1U4,Smanga_S4: 0.400731
HIV1U4,Smanga_SP: 0.399513
HIV1Z2,HIV2CA: 0.393423
HIV1Z2,HIV2D1: 0.397077
HIV1Z2,HIV2G1: 0.387333
HIV1Z2,HIV2KR: 0.388551
HIV1Z2,HIV2RO: 0.389769
HIV1Z2,HIV2SB: 0.389769
HIV1Z2,HIV2ST: 0.389769
HIV1Z2,SIVCZ: 0.125457
HIV1Z2,Smanga_S4: 0.388551
HIV1Z2,Smanga_SP: 0.388551
HIV2CA,HIV2D1: 0.0816078
HIV2CA,HIV2G1: 0.0694275
HIV2CA,HIV2KR: 0.0645554
HIV2CA,HIV2RO: 0.0511571
HIV2CA,HIV2SB: 0.0682095
HIV2CA,HIV2ST: 0.0657734
HIV2CA,SIVCZ: 0.392205
HIV2CA,Smanga_S4: 0.125457
HIV2CA,Smanga_SP: 0.120585
HIV2D1,HIV2G1: 0.0511571
HIV2D1,HIV2KR: 0.0840438
HIV2D1,HIV2RO: 0.088916
HIV2D1,HIV2SB: 0.09257
HIV2D1,HIV2ST: 0.0864799
HIV2D1,SIVCZ: 0.397077
HIV2D1,Smanga_S4: 0.131547
HIV2D1,Smanga_SP: 0.129111
HIV2G1,HIV2KR: 0.0779537
HIV2G1,HIV2RO: 0.0730816
HIV2G1,HIV2SB: 0.0791717
HIV2G1,HIV2ST: 0.0767357
HIV2G1,SIVCZ: 0.394641
HIV2G1,Smanga_S4: 0.127893
HIV2G1,Smanga_SP: 0.121803
HIV2KR,HIV2RO: 0.0645554
HIV2KR,HIV2SB: 0.0633374
HIV2KR,HIV2ST: 0.0572473
HIV2KR,SIVCZ: 0.392205
HIV2KR,Smanga_S4: 0.118149
HIV2KR,Smanga_SP: 0.112058
HIV2RO,HIV2SB: 0.0682095
HIV2RO,HIV2ST: 0.0621194
HIV2RO,SIVCZ: 0.386114
HIV2RO,Smanga_S4: 0.120585
HIV2RO,Smanga_SP: 0.118149
HIV2SB,HIV2ST: 0.0657734
HIV2SB,SIVCZ: 0.389769
HIV2SB,Smanga_S4: 0.126675
HIV2SB,Smanga_SP: 0.123021
HIV2ST,SIVCZ: 0.394641
HIV2ST,Smanga_S4: 0.116931
HIV2ST,Smanga_SP: 0.115713
SIVCZ,Smanga_S4: 0.388551
SIVCZ,Smanga_SP: 0.388551
Smanga_S4,Smanga_SP: 0.0146163
</pre>

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/36611/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 53 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 26 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [[Media:Ribosomal_proteins_34.fasta.txt]]

==Step 10==
Open the FASTA file with the 34 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular". Here is the result:

[[File:Ribosomal_proteins_34-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_34-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-Seaview.png]]

==Step 12==
Here is the rerooted tree made by iTOL:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]

Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.

==Step 13==
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

Exercise: Phylogeny - Answers (Seaview version)

2024-03-15T11:30:52Z

WikiSysop: /* Step 1 */

Exercise: Phylogeny - Answers (Seaview version)

2024-03-15T11:27:55Z

WikiSysop: Created page with "== Step 1 == Here is a PDF with the aligned sequences. ==Step 2== Here is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7. ==Step3== Here is a picture of the NJ tree: File:Pol21-NJ_tree.png The longest branch is the one leading to HTLV, which is in good agreement with the observation in the prev..."

== Step 1 ==
[[Media:Pol21.aligned.pdf|Here]] is a PDF with the aligned sequences.

==Step 2==
[[Media:Pol21.dist.txt|Here]] is the text file with the pairwise distances. It is clear that the sequence HTLV shows larger distances than all the other sequences, with all distances being above 0.7.

==Step3==
Here is a picture of the NJ tree:

[[File:Pol21-NJ_tree.png]]

The longest branch is the one leading to HTLV, which is in good agreement with the observation in the previous question.

==Step 4==
Here is an unrooted tree:

[[File:Pol21-NJ_tree.unrooted.png]]

==Step 5==
Here is a rearranged (swapped) tree:

[[File:Pol21-NJ_tree.swapped.png]]

==Step 6==
* The sister group to the HIV1 sequences is SIVCZ (Chimpanzee SIV).
* The sister group to the HIV2 sequences is Smanga (Sooty Mangabey SIV).
* Further answers to "The Phylogeny of HIV" can be found [https://teaching.healthtech.dtu.dk/material/36611/files/binfintro/hiv_origin.html here].

==Step 7==
There are several correct ways of doing this, since you can choose between several alignment methods. It could be argued that RevTrans is the most correct option, since we have coding DNA, and RevTrans gives us the "best of both worlds": it takes into account amino acid similarities when aligning, while it still has the non-coding differences in the aligned DNA. The trees below have been constructed using RevTrans. However, aligning the DNA directly with Clustal Omega in Seaview produces almost identical results and leads to the same conclusion.

Here is the tree made ignoring gap positions:

[[File:L18_CDS-NJ_tree.revtrans.nogaps.png]]

And here is the tree made taking gap positions into account:

[[File:L18_CDS-NJ_tree.revtrans.wgaps.png]]

There is one difference in the tree topology between the two trees: In the one made without the gap positions, Rice is together with Fruit fly within the animal subtree, while in the other tree, Rice is together with the two other plants. Since Rice is a plant, the tree taking gap positions into account is the most correct one. '''Note:''' This is not always the case!

==Step 8==
On the whole, the structure of this tree is exactly as we would expect it, based on the known phylogeny. However, the placement of salmon and frog together in a monophyletic group is not correct. The correct species phylogeny would have salmon branching out before frog, which would branch out before the group of mammals (see illustration below). Mammals and frogs belong together in the group ''Tetrapoda''.

[[file:salmon_frog.png‎|center|frame]]

There are two additional errors, which are not as easy to detect but can be seen if all the taxa are compared using NCBI Taxonomy's "Common Tree" function (see illustration below).

First, the group of Human+Macaque is placed as a sister group to Pig+Whale, which is not correct. Human+Macaque should have been a sister group to Rat+Mouse, since primates and rodents belong together in the group ''Euarchontoglires''.

Second, yeast is placed further from the animals than the plants are — that is also not correct. Yeast (and indeed all Fungi) actually belong together with the animals in the group ''Opisthokonta''.

[[file:L18_Common_Taxonomy_Tree.png|center|frame]]

It is often seen that a phylogeny based on a single gene differs from the real phylogeny of the species. There are a number of reasons for why this happens, but one important one is simply the stochastic nature of mutations: Occasionally a gene will be most similar to the gene from a non-sister species, for entirely random reasons. This phenomenon tends to disappear as more sequence data is included in the analysis (the law of large numbers).

==Step 9==
# 53 results. Search string:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true)</tt>
# 8 and 26 results, respectively. Search strings:  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0173)</tt> and  <tt>(protein_name:"ribosomal protein l3") AND (taxonomy_id:2759) AND (fragment:false) AND (reviewed:true) AND (cc_scl_term:SL-0086)</tt> Under the Download tab in UniProt, select "Download all", "FASTA (canonical)" and "Uncompressed".
# Then use a plain text editor to combine them. Combined FASTA file is here: [[Media:Ribosomal_proteins_34.fasta.txt]]

==Step 10==
Open the FASTA file with the 34 ribosomal protein sequences in Seaview, make sure Alignment options is set to "clustalo", and align all sequences. Then make an NJ tree (with Ignore all gap sites unchecked) and change the view to "circular". Here is the result:

[[File:Ribosomal_proteins_34-NJ_tree.unrooted.png]]

And [[Media:Ribosomal_proteins_34-NJ_tree.unrooted.newick.txt|here]] is the unrooted Newick tree file.

== Step 11 ==
Here is the rerooted tree made by Seaview:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-Seaview.png]]

==Step 12==
Here is the rerooted tree made by iTOL:

[[File:Ribosomal_proteins_34-NJ_tree.rerooted-iTOL.png]]

Yes, there is a difference: The tree from iTOL has the mitochondrial tips further to the right, while the tree from Seaview has the mitochondrial tips approximately aligned with the cytoplasmic ones. Note that when you select a branch for rerooting, the exact placement of the root on that branch is arbitrary. iTOL chooses the midpoint of the selected branch, while Seaview chooses a point that is closer to the midpoint of the entire tree. Without external information, it is not possible to say which method is most correct.

==Step 13==
Here is the annotated tree, with blue circles marking the most recent common ancestor of human and yeast, and the green circles marking the most recent common ancestor of human and mouse:

[[File:Ribosomal_proteins_34-NJ_tree.annotated-iTOL.png]]

==Step 14==
# The mitochondrial proteins are more closely related to each other than to their respective cytoplasmic counterparts. This could indicate that mitochondria have appeared only once in evolution.
# There are two differences: In the mitochondria, Bovine (cow) is the sister group to Human, while in the cytoplasmic proteins, Mouse+Rat comprise the sister group to Human+Macaque. Also, in the mitochondria, Yeast branches out before Arabidopsis on the way to Human, while in the cytoplasmic proteins, the plants including Arabidopsis branch out (slightly) before the fungi including Yeast. In both aspects, the cytoplasmic tree is more correct.
# There are more mutations per time unit in the mitochondrial part of the tree. This is evident from the fact that the horizontal distance between the blue and the green circle is larger in the mitochondrial subtree (by approximately a factor 2). Note that the two blue circles represent the same time point in evolutionary history, as do the two green circles. Note also that the branch lengths are proportional to the number of substitutions (accepted mutations).

Exercise: Multiple Alignments Answers (Seaview version)

2024-03-15T11:26:27Z

WikiSysop: /* Question 1 */

By: [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson]

==Question 1==
FASTA file:

>pigeon_alpha-D-globin
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
>pigeon_alpha-A-globin
ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTG
ACTTGGGTGGTGAAGCCCTGGAGAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTT
CGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCT
GCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTG
TGGACCCCGTCAACTTCAAACTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCT
CCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAG
TACCGTTAA
>duck_alpha-D-globin
ATGCTGACCGCCGAGGACAAGAAGCTCATCGTGCAGGTGTGGGAGAAGGTGGCTGGCCACCAGGAGGAAT
TCGGAAGTGAAGCTCTGCAGAGGATGTTCCTCGCCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGA
CCTGCATCCCGGCTCTGAACAGGTCCGTGGCCATGGCAAGAAAGTGGCGGCTGCCCTGGGCAATGCCGTG
AAGAGCCTGGACAACCTCAGCCAGGCCCTGTCTGAGCTCAGCAACCTGCATGCCTACAACCTGCGTGTTG
ACCCTGTCAACTTCAAGCTGCTGGCACAGTGCTTCCAGGTGGTGCTGGCCGCACACCTGGGCAAAGACTA
CAGCCCCGAGATGCATGCTGCCTTTGACAAGTTCTTGTCCGCCGTGGCTGCCGTGCTGGCTGAAAAGTAC
AGATGA
>duck_alpha-A-globin
ATGGTGCTGTCTGCGGCTGACAAGACCAACGTCAAGGGTGTCTTCTCCAAAATCGGTGGCCATGCTGAGG
AGTATGGCGCCGAGACCCTGGAGAGGATGTTCATCGCCTACCCCCAGACCAAGACCTACTTCCCCCACTT
TGACCTGCAGCACGGCTCTGCTCAGATCAAGGCCCATGGCAAGAAGGTGGCGGCTGCCCTAGTTGAAGCT
GTCAACCACATCGATGACATTGCGGGTGCTCTCTCCAAGCTCAGTGACCTCCACGCCCAAAAGCTCCGTG
TGGACCCTGTCAACTTCAAATTCCTGGGCCACTGCTTCCTGGTGGTGGTTGCCATCCACCACCCCGCTGC
CCTGACCCCAGAGGTCCACGCTTCCCTGGACAAGTTCATGTGCGCCGTGGGTGCTGTGCTGACTGCCAAG
TACCGTTAG
>Goat_alpha-i-globin
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCGGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCTCCCCAATGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Goat_alpha-ii-globin
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCAGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCACCCCAGTGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Horse_alpha-1_globin
ATGGTGCTGTCTGCCGCCGACAAGACCAACGTCAAGGCCGCCTGGAGTAAGGTTGGCGGCCACGCTGGCG
AGTTTGGCGCAGAGGCCCTAGAGAGGATGTTCCTGGGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGATCTGAGCCACGGCTCCGCCCAGGTCAAGGCCCACGGCAAGAAGGTGGGCGACGCGCTGACTCTCGCC
GTGGGCCACCTGGACGACCTGCCTGGCGCCCTGTCGAATCTGAGCGACCTGCACGCACACAAGCTGCGCG
TGGACCCCGTCAACTTCAAGCTTCTGAGTCATTGCCTGCTGTCCACCTTGGCCGTCCACCTCCCCAACGA
TTTCACCCCTGCCGTCCACGCCTCCCTGGACAAGTTCTTGAGCAGTGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Horse_alpha-2_globin
ATGGTGCTGTCTGCCGCCGACAAGACCAACGTCAAGGCCGCCTGGAGTAAGGTTGGCGGCCACGCTGGCG
AGTATGGCGCAGAGGCCCTAGAGAGGATGTTCCTGGGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGATCTGAGCCACGGCTCCGCCCAGGTCAAGGCCCACGGCCAGAAGGTGGGCGACGCGCTGACTCTCGCC
GTGGGCCACCTGGACGACCTGCCTGGCGCCCTGTCGAATCTGAGCGACCTGCACGCACACAAGCTGCGCG
TGGACCCCGTCAACTTCAAGCTCCTGAGTCATTGCCTGCTGTCCACCTTGGCCGTCCACCTCCCCAACGA
TTTCACCCCTGCCGTCCACGCCTCCCTGGACAAGTTCTTGAGCAGTGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Chicken_alpha-D
ATGCTGACTGCCGAGGACAAGAAGCTCATCCAGCAGGCCTGGGAGAGGGCCGCTTCCCACCAGGAGGAGT
TTGGAGCTGAGGCTCTGACTAGGATGTTCACCACCTATCCCCAGACCAAGACCTACTTCCCCCACTTCGA
CCTTTCGCCTGGCTCTGACCAGGTCCGTGGCCATGGCAAGAAGGTGTTGGGTGCCCTGGGCAACGCCGTG
AAGAACGTGGACAACCTCAGCCAGGCCATGGCTGAGCTGAGCAACCTGCATGCCTACAACCTGCGTGTTG
ACCCCGTCAATTTCAAGCTGTTGTCGCAGTGCATCCAGGTGGTGCTGGCTGTACACATGGGCAAAGACTA
CACCCCTGAAGTGCATGCTGCCTTCGACAAGTTCCTGTCTGCCGTGTCTGCTGTGCTGGCTGAGAAGTAC
AGATAA
>Chicken_alpha-A
ATGGTGCTGTCCGCTGCTGACAAGAACAACGTCAAGGGCATCTTCACCAAAATCGCCGGCCATGCTGAGG
AGTATGGCGCCGAGACCCTGGAAAGGATGTTCACCACCTACCCCCCAACCAAGACCTACTTCCCCCACTT
CGATCTGTCACACGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTAGTGGCTGCCTTGATCGAGGCT
GCCAACCACATTGATGACATCGCCGGCACCCTCTCCAAGCTCAGCGACCTCCATGCCCACAAGCTCCGCG
TGGACCCTGTCAACTTCAAACTCCTGGGCCAATGCTTCCTGGTGGTGGTGGCCATCCACCACCCTGCTGC
CCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCTTGTGCGCCGTGGGCACTGTGCTGACCGCCAAG
TACCGTTAA

'''NOTICE''':
* It is essential to use SHORT descriptive names. In the ClustalW format alignment, only the first 15 characters of the names are shown, so if you have very long names the output can be hard to read (see also [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf the FASTA handout from week 2]).
* Spaces cannot be part of the names in a FASTA file. If there are spaces, only the first word after "<tt>></tt>" counts as the name, subsequent words will be comments. If I had used spaces instead of underscore ("<tt>_</tt>") in the file above, the names would not have been unique ("duck" would have been used twice, etc.).
* Be aware that in GenBank entries containing several genes (see [https://teaching.healthtech.dtu.dk/material/22111/GenBank+FASTA_handout_revised.pdf the GenBank handout from week 2]), the name of the individual gene (CDS) is found within the feature table. When you click on a CDS containing "<tt>/gene_name=XYZ</tt>" or similar, it is therefore XYZ you need to use as name in your FASTA file, not the collective title for the entire GenBank entry (e.g. "<tt>Alpha-A and Alpha-D genes ...</tt>" or "<tt>Yeast Chromosome 2</tt>"). See also [https://teaching.healthtech.dtu.dk/material/22111/MultiGeneScreenshot-en.pdf the screenshot/handout from the exercise].
When you build a "real" dataset for a research project, it is often an iterative process, where you 1) collect your data, 2) weed out outliers, 3) run an analysis, and repeat 2) and 3) until you are satisfied with the results.

==Question 2==
===QUESTION 2a===
Your screenshot of the 3' part of the alignment should look something like this: [[Image:Seaview-Q2-aligned.png|Seaview showing aligned sequences]]
===QUESTION 2b===
#Your tree should look like this: [[Image:Seaview-Q2-tree.png|Seaview tree]]
#There are three clusters: One for Alpha-A (birds only), one for Alpha-D (birds only), and one for Alpha 1 + Alpha 2 (Mammals). The idea is here that birds and mammals are not intermixed, so they are "naturally" placed in a taxonomical sense.
#Alpha-A and Alpha-D are obviously in two different clusters — that must necessarily mean that the split between them is old. Since both Alpha-A and Alpha-D exist in all the three birds we included, the split must be older than the last common ancestor to the birds.
#Alpha-1 and Alpha-2 seem to be much more closely related.
===QUESTION 2c===
There is a single stretch of >15 nucleotides (23 to be precise) which is perfectly conserved. Its sequence is <tt>ACCAAGACCTACTTCCCCCACTT</tt>.

==Question 3==
The sequences are translated using Virtual Ribosome, giving rise to the following FASTA file:

>pigeon_alpha-D-globin
MLTDSDKKLVLQVWEKVIRHPDCGAEALERLFTTYPQTKTYFPHFDLHHGSDQVRNHGKK
VLAALGNAVKSLGNLSQALSDLSDLHAYNLRVDPVNFKLLAQCFHVVLATHLGNDYTPEA
HAAFDKFLSAVCTVLAEKYR*
>pigeon_alpha-A-globin
MVLSANDKSNVKAVFGKIGGQAGDLGGEALERLFITYPQTKTYFPHFDLSHGSAQIKGHG
KKVAEALVEAANHIDDIAGALSKLSDLHAQKLRVDPVNFKLLGHCFLVVVAVHFPSLLTP
EVHASLDKFVCAVGTVLTAKYR*
>duck_alpha-D-globin
MLTAEDKKLIVQVWEKVAGHQEEFGSEALQRMFLAYPQTKTYFPHFDLHPGSEQVRGHGK
KVAAALGNAVKSLDNLSQALSELSNLHAYNLRVDPVNFKLLAQCFQVVLAAHLGKDYSPE
MHAAFDKFLSAVAAVLAEKYR*
>duck_alpha-A-globin
MVLSAADKTNVKGVFSKIGGHAEEYGAETLERMFIAYPQTKTYFPHFDLQHGSAQIKAHG
KKVAAALVEAVNHIDDIAGALSKLSDLHAQKLRVDPVNFKFLGHCFLVVVAIHHPAALTP
EVHASLDKFMCAVGAVLTAKYR*
>Goat_alpha-i-globin
MVLSAADKSNVKAAWGKVGGNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
EKVAAALTKAVGHLDDLPGTLSDLSDLHAHKLRVDPVNFKLLSHSLLVTLACHLPNDFTP
AVHASLDKFLANVSTVLTSKYR*
>Goat_alpha-ii-globin
MVLSAADKSNVKAAWGKVGSNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
EKVAAALTKAVGHLDDLPGTLSDLSDLHAHKLRVDPVNFKLLSHSLLVTLACHHPSDFTP
AVHASLDKFLANVSTVLTSKYR*
>Horse_alpha-1_globin
MVLSAADKTNVKAAWSKVGGHAGEFGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHG
KKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTP
AVHASLDKFLSSVSTVLTSKYR*
>Horse_alpha-2_globin
MVLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHG
QKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTP
AVHASLDKFLSSVSTVLTSKYR*
>Chicken_alpha-D
MLTAEDKKLIQQAWERAASHQEEFGAEALTRMFTTYPQTKTYFPHFDLSPGSDQVRGHGK
KVLGALGNAVKNVDNLSQAMAELSNLHAYNLRVDPVNFKLLSQCIQVVLAVHMGKDYTPE
VHAAFDKFLSAVSAVLAEKYR*
>Chicken_alpha-A
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHFDLSHGSAQIKGHG
KKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRVDPVNFKLLGQCFLVVVAIHHPAALTP
EVHASLDKFLCAVGTVLTAKYR*

Subsequently, they are aligned with Clustal Omega.

Observations:
* By and large the same tree on protein level as on DNA level (small differences in the branch lengths).
* Now, two completely conserved regions of >5 amino acids are seen. Their sequences are <tt>TKTYFPHFDL</tt> and <tt>LRVDPVNFK</tt>.

==Question 4==
FASTA file:

>Sheep_U00659
ATGGCCCTGTGGACACGCCTGGTGCCCCTGCTGGCCCTGCTGGCACTCTGGGCCCCCGCC
CCGGCCCACGCCTTCGTCAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGAGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGCCGGGAGGTGGAGGGC
CCCCAGGTGGGGGCGCTGGAGCTGGCCGGAGGCCCCGGCGCGGGTGGCCTGGAGGGGCCC
CCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCGCCGGCGTCTGCTCTCTCTACCAGCTG
GAGAACTACTGTAACTAG
>Pig_AY044828
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242098
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242100
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242101
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242109
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTAGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Dog_V00179
ATGGCCCTCTGGATGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCG
CCCACCCGAGCCTTCGTTAACCAGCACCTGTGTGGCTCCCACCTGGTAGAGGCTCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCTAAGGCCCGCAGGGAGGTGGAGGAC
CTGCAGGTGAGGGACGTGGAGCTGGCCGGGGCGCCTGGCGAGGGCGGCCTGCAGCCCCTG
GCCCTGGAGGGGGCCCTGCAGAAGCGAGGCATCGTGGAGCAGTGCTGCACCAGCATCTGC
TCCCTCTACCAGCTGGAGAATTACTGCAACTAG
>OwlMonkey_J02989
ATGGCCCTGTGGATGCACCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCCGAG
CCAGCCCCGGCCTTTGTGAACCAGCACCTGTGCGGCCCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGTTTCTTCTACGCACCCAAGACCCGCCGGGAGGCGGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGTGGGGGCTCTATCACGGGCAGCCTGCCACCCTTG
GAGGGTCCCATGCAGAAGCGTGGCGTCGTGGATCAGTGCTGCACCAGCATCTGCTCCCTC
TACCAGCTGCAGAACTACTGCAACTAG
>Human_AY138590
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>GreenMonkey_X61092
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCGGTCCCGGCCTTTGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGCTTCTTCTACACGCCCAAGACCCGCCGGGAGGCAGAGGAC
CCGCAGGTGGGGCAGGTAGAGCTGGGCGGGGGCCCTGGCGCAGGCAGCCTGCAGCCCTTG
GCGCTGGAGGGGTCCCTGCAGAAGCGCGGCATCGTGGAGCAGTGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>Human_J00265
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>Chimp_X61089
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCTCGGCCTTTGTGAACCAACACCTGTGCGGCTCCCACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGTATCGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>GuineaPig_K02233
ATGGCTCTGTGGATGCATCTCCTCACCGTGCTGGCCCTGCTGGCCCTCTGGGGGCCCAAC
ACTAATCAGGCCTTTGTCAGCCGGCATCTGTGCGGCTCCAACTTAGTGGAGACATTGTAT
TCAGTGTGTCAGGATGATGGCTTCTTCTATATACCCAAGGACCGTCGGGAGCTAGAGGAC
CCACAGGTGGAGCAGACAGAACTGGGCATGGGCCTGGGGGCAGGTGGACTACAGCCCTTG
GCACTGGAGATGGCACTACAGAAGCGTGGCATTGTGGATCAGTGCTGTACTGGCACCTGC
ACACGCCACCAGCTGCAGAGCTACTGCAACTAG
>Mouse_X04725
ATGGCCCTGTTGGTGCACTTCCTACCCCTGCTGGCCCTGCTTGCCCTCTGGGAGCCCAAA
CCCACCCAGGCTTTTGTCAAACAGCATCTTTGTGGTCCCCACCTGGTAGAGGCTCTCTAC
CTGGTGTGTGGGGAGCGTGGCTTCTTCTACACACCCAAGTCCCGCCGTGAAGTGGAGGAC
CCACAAGTGGAACAACTGGAGCTGGGAGGAAGCCCCGGGGACCTTCAGACCTTGGCGTTG
GAGGTGGCCCGGCAGAAGCGTGGCATTGTGGATCAGTGCTGCACCAGCATCTGCTCCCTC
TACCAGCTGGAGAACTACTGCAACTAA
>Chicken_AY438372
ATGGCTCTCTGGATCCGATCACTGCCTCTTCTGGCTCTCCTTGTCTTTTCTGGCCCTGGA
ACCAGCTATGCAGCTGCCAACCAGCACCTCTGTGGCTCCCACTTGGTGGAGGCTCTCTAC
CTGGTGTGTGGAGAGCGTGGCTTCTTCTACTCCCCCAAAGCCCGACGGGATGTCGAGCAG
CCCCTAGTGAGCAGTCCCTTGCGTGGCGAGGCAGGAGTGCTGCCTTTCCAGCAGGAGGAA
TACGAGAAAGTCAAGCGAGGGATTGTTGAGCAATGCTGCCATAACACGTGTTCCCTCTAC
CAACTGGAGAACTACTGCAACTAG
>SeaHare_AF160192
ATGAGCAAGTTCCTCCTCCAGAGCCACTCCGCCAACGCCTGCCTGCTCACCCTTCTGCTCACGCTGGCCT
CCAACCTCGACATATCCCTGGCCAACTTCGAGCACTCGTGCAACGGCTACATGCGGCCCCACCCGCGGGG
TCTGTGCGGCGAAGACCTGCACGTCATCATTTCCAACCTGTGCAGCTCTCTGGGGGGCAACAGGAGGTTC
CTGGCCAAGTACATGGTCAAAAGAGACACGGAAAATGTGAACGACAAGTTACGAGGGATCCTGCTCAATA
AGAAAGAAGCTTTCTCCTACTTGACCAAGAGAGAGGCCTCAGGCTCCATCACATGCGAATGTTGCTTCAA
CCAGTGTCGGATATTTGAGCTGGCTCAGTACTGCCGTCTGCCAGACCATTTCTTCTCCAGAATATCCAGA
ACCGGAAGGAGCAACAGTGGACATGCGCAGTTGGAGGACAACTTTAGTTA

==Question 5==
1. Yes, there are many gaps which are not multiples of 3 positions. The most obvious example is the second gap, which is 4 positions long (in all sequences but the Sea Hare, see below). The alignment algorithm is ''not'' aware that the sequences are protein coding, it only considers the DNA.

[[Image:Seaview-Q5.png]]

2. Sea Hare (a marine snail) stands out — this makes sense, since it is the only invertebrate.

3. It can be seen that the two human sequences are 100% identical (the distance is 0) — one of them can therefore be discarded — and for the pig, the following sequences are identical:

>Pig_AY044828
>Pig_AY242098

and

>Pig_AY242100
>Pig_AY242101

(two pig sequences can therefore be discarded).

==Question 6==
The sequences are translated using Virtual Ribosome, yielding the following sequences:

>Sheep_U00659
MALWTRLVPLLALLALWAPAPAHAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEG
PQVGALELAGGPGAGGLEGPPQKRGIVEQCCAGVCSLYQLENYCN*
>Pig_AY044828
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242098
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242100
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242101
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242109
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Dog_V00179
MALWMRLLPLLALLALWAPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVED
LQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN*
>OwlMonkey_J02989
MALWMHLLPLLALLALWGPEPAPAFVNQHLCGPHLVEALYLVCGERGFFYAPKTRREAED
LQVGQVELGGGSITGSLPPLEGPMQKRGVVDQCCTSICSLYQLQNYCN*
>Human_AY138590
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>GreenMonkey_X61092
MALWMRLLPLLALLALWGPDPVPAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
PQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>Human_J00265
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>Chimp_X61089
MALWMRLLPLLVLLALWGPDPASAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>GuineaPig_K02233
MALWMHLLTVLALLALWGPNTNQAFVSRHLCGSNLVETLYSVCQDDGFFYIPKDRRELED
PQVEQTELGMGLGAGGLQPLALEMALQKRGIVDQCCTGTCTRHQLQSYCN*
>Mouse_X04725
MALLVHFLPLLALLALWEPKPTQAFVKQHLCGPHLVEALYLVCGERGFFYTPKSRREVED
PQVEQLELGGSPGDLQTLALEVARQKRGIVDQCCTSICSLYQLENYCN*
>Chicken_AY438372
MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHLVEALYLVCGERGFFYSPKARRDVEQ
PLVSSPLRGEAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLENYCN*
>SeaHare_AF160192
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNL
CSSLGGNRRFLAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCR
IFELAQYCRLPDHFFSRISRTGRSNSGHAQLEDNFS*

Subsequently, the sequences are aligned. Note that the gaps in the peptide alignment do not correspond to the gaps in the nucleotide alignment.
# There is a disagreement between the DNA and peptide alignment because
## the DNA alignment does not take codon boundaries into account, and
## the peptide alignment can take similarities between amino acids (conservative substitutions) into account.
# At the protein level, all the Pig sequences are now completely identical. Four of them can therefore be discarded.

==Question 7==

Yes, the alignments are different. None of the four methods solves the problem perfectly, but Clustal Omega and MAFFT are really close; they both place only one letter incorrectly, see below.

[[Image:EPB4.1_human.clustalo.fasta.png]]

Clustal Omega: Note the two K's aligned with V's to the left of the large gaps.

[[Image:EPB4.1_human.mafft.fasta.png]]

MAFFT: Note the three Q's aligned with E's to the right of the large gaps.

MUSCLE and Kalign make more errors.

==Question 8==
* Yes — all gaps are multiples of 3.
* Yes — since the DNA alignment is generated using a protein alignment as a scaffold.

File:EPB4.1 human.mafft.fasta.png

2024-03-15T11:25:07Z

WikiSysop:

File:EPB4.1 human.clustalo.fasta.png

2024-03-15T11:24:44Z

WikiSysop:

File:Seaview-Q5.png

2024-03-15T11:24:18Z

WikiSysop:

File:Seaview-Q2-tree.png

2024-03-15T11:23:49Z

WikiSysop:

File:Seaview-Q2-aligned.png

2024-03-15T11:23:19Z

WikiSysop:

Exercise: Multiple Alignments Answers (Seaview version)

2024-03-15T11:22:20Z

WikiSysop: Created page with " By: [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson] ==Question 1== FASTA file: >pigeon_alpha-D-globin ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC CTGTCAACTTCAAG..."

By: [http://www.dtu.dk/service/telefonbog/person?id=18103&cpid=214039&tab=2&qt=dtupublicationquery Rasmus Wernersson]

==Question 1==
FASTA file:

>pigeon_alpha-D-globin
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
>pigeon_alpha-A-globin
ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTG
ACTTGGGTGGTGAAGCCCTGGAGAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTT
CGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCT
GCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTG
TGGACCCCGTCAACTTCAAACTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCT
CCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAG
TACCGTTAA
>duck_alpha-D-globin
ATGCTGACCGCCGAGGACAAGAAGCTCATCGTGCAGGTGTGGGAGAAGGTGGCTGGCCACCAGGAGGAAT
TCGGAAGTGAAGCTCTGCAGAGGATGTTCCTCGCCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGA
CCTGCATCCCGGCTCTGAACAGGTCCGTGGCCATGGCAAGAAAGTGGCGGCTGCCCTGGGCAATGCCGTG
AAGAGCCTGGACAACCTCAGCCAGGCCCTGTCTGAGCTCAGCAACCTGCATGCCTACAACCTGCGTGTTG
ACCCTGTCAACTTCAAGCTGCTGGCACAGTGCTTCCAGGTGGTGCTGGCCGCACACCTGGGCAAAGACTA
CAGCCCCGAGATGCATGCTGCCTTTGACAAGTTCTTGTCCGCCGTGGCTGCCGTGCTGGCTGAAAAGTAC
AGATGA
>duck_alpha-A-globin
ATGGTGCTGTCTGCGGCTGACAAGACCAACGTCAAGGGTGTCTTCTCCAAAATCGGTGGCCATGCTGAGG
AGTATGGCGCCGAGACCCTGGAGAGGATGTTCATCGCCTACCCCCAGACCAAGACCTACTTCCCCCACTT
TGACCTGCAGCACGGCTCTGCTCAGATCAAGGCCCATGGCAAGAAGGTGGCGGCTGCCCTAGTTGAAGCT
GTCAACCACATCGATGACATTGCGGGTGCTCTCTCCAAGCTCAGTGACCTCCACGCCCAAAAGCTCCGTG
TGGACCCTGTCAACTTCAAATTCCTGGGCCACTGCTTCCTGGTGGTGGTTGCCATCCACCACCCCGCTGC
CCTGACCCCAGAGGTCCACGCTTCCCTGGACAAGTTCATGTGCGCCGTGGGTGCTGTGCTGACTGCCAAG
TACCGTTAG
>Goat_alpha-i-globin
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCGGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCTCCCCAATGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Goat_alpha-ii-globin
ATGGTGCTGTCTGCCGCCGACAAGTCCAATGTCAAGGCCGCCTGGGGCAAGGTTGGCAGCAACGCTGGAG
CTTATGGCGCAGAGGCTCTGGAGAGGATGTTCCTGAGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGACCTGAGCCACGGCTCGGCCCAGGTCAAGGGCCACGGCGAGAAGGTGGCCGCCGCGCTGACCAAAGCG
GTGGGCCACCTGGACGACCTGCCCGGTACTCTGTCTGATCTGAGTGACCTGCACGCCCACAAGCTGCGTG
TGGACCCGGTCAACTTTAAGCTTCTGAGCCACTCCCTGCTGGTGACCCTGGCCTGCCACCACCCCAGTGA
TTTCACCCCCGCGGTCCACGCCTCCCTGGACAAGTTCTTGGCCAACGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Horse_alpha-1_globin
ATGGTGCTGTCTGCCGCCGACAAGACCAACGTCAAGGCCGCCTGGAGTAAGGTTGGCGGCCACGCTGGCG
AGTTTGGCGCAGAGGCCCTAGAGAGGATGTTCCTGGGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGATCTGAGCCACGGCTCCGCCCAGGTCAAGGCCCACGGCAAGAAGGTGGGCGACGCGCTGACTCTCGCC
GTGGGCCACCTGGACGACCTGCCTGGCGCCCTGTCGAATCTGAGCGACCTGCACGCACACAAGCTGCGCG
TGGACCCCGTCAACTTCAAGCTTCTGAGTCATTGCCTGCTGTCCACCTTGGCCGTCCACCTCCCCAACGA
TTTCACCCCTGCCGTCCACGCCTCCCTGGACAAGTTCTTGAGCAGTGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Horse_alpha-2_globin
ATGGTGCTGTCTGCCGCCGACAAGACCAACGTCAAGGCCGCCTGGAGTAAGGTTGGCGGCCACGCTGGCG
AGTATGGCGCAGAGGCCCTAGAGAGGATGTTCCTGGGCTTCCCCACCACCAAGACCTACTTCCCCCACTT
CGATCTGAGCCACGGCTCCGCCCAGGTCAAGGCCCACGGCCAGAAGGTGGGCGACGCGCTGACTCTCGCC
GTGGGCCACCTGGACGACCTGCCTGGCGCCCTGTCGAATCTGAGCGACCTGCACGCACACAAGCTGCGCG
TGGACCCCGTCAACTTCAAGCTCCTGAGTCATTGCCTGCTGTCCACCTTGGCCGTCCACCTCCCCAACGA
TTTCACCCCTGCCGTCCACGCCTCCCTGGACAAGTTCTTGAGCAGTGTGAGCACCGTGCTGACCTCCAAA
TACCGTTAA
>Chicken_alpha-D
ATGCTGACTGCCGAGGACAAGAAGCTCATCCAGCAGGCCTGGGAGAGGGCCGCTTCCCACCAGGAGGAGT
TTGGAGCTGAGGCTCTGACTAGGATGTTCACCACCTATCCCCAGACCAAGACCTACTTCCCCCACTTCGA
CCTTTCGCCTGGCTCTGACCAGGTCCGTGGCCATGGCAAGAAGGTGTTGGGTGCCCTGGGCAACGCCGTG
AAGAACGTGGACAACCTCAGCCAGGCCATGGCTGAGCTGAGCAACCTGCATGCCTACAACCTGCGTGTTG
ACCCCGTCAATTTCAAGCTGTTGTCGCAGTGCATCCAGGTGGTGCTGGCTGTACACATGGGCAAAGACTA
CACCCCTGAAGTGCATGCTGCCTTCGACAAGTTCCTGTCTGCCGTGTCTGCTGTGCTGGCTGAGAAGTAC
AGATAA
>Chicken_alpha-A
ATGGTGCTGTCCGCTGCTGACAAGAACAACGTCAAGGGCATCTTCACCAAAATCGCCGGCCATGCTGAGG
AGTATGGCGCCGAGACCCTGGAAAGGATGTTCACCACCTACCCCCCAACCAAGACCTACTTCCCCCACTT
CGATCTGTCACACGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTAGTGGCTGCCTTGATCGAGGCT
GCCAACCACATTGATGACATCGCCGGCACCCTCTCCAAGCTCAGCGACCTCCATGCCCACAAGCTCCGCG
TGGACCCTGTCAACTTCAAACTCCTGGGCCAATGCTTCCTGGTGGTGGTGGCCATCCACCACCCTGCTGC
CCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCTTGTGCGCCGTGGGCACTGTGCTGACCGCCAAG
TACCGTTAA

'''NOTICE''':
* It is essential to use SHORT descriptive names. In the ClustalW format alignment, only the first 15 characters of the names are shown, so if you have very long names the output can be hard to read (see also [[Media:GenBank+FASTA_handout_revised.pdf|the FASTA handout from week 2]]).
* Spaces cannot be part of the names in a FASTA file. If there are spaces, only the first word after "<tt>></tt>" counts as the name, subsequent words will be comments. If I had used spaces instead of underscore ("<tt>_</tt>") in the file above, the names would not have been unique ("duck" would have been used twice, etc.).
* Be aware that in GenBank entries containing several genes (see [[Media:GenBank+FASTA_handout_revised.pdf|the GenBank handout from week 2]]), the name of the individual gene (CDS) is found within the feature table. When you click on a CDS containing "<tt>/gene_name=XYZ</tt>" or similar, it is therefore XYZ you need to use as name in your FASTA file, not the collective title for the entire GenBank entry (e.g. "<tt>Alpha-A and Alpha-D genes ...</tt>" or "<tt>Yeast Chromosome 2</tt>"). See also [[Media:MultiGeneScreenshot-en.pdf| the screenshot/handout from the exercise]].
When you build a "real" dataset for a research project, it is often an iterative process, where you 1) collect your data, 2) weed out outliers, 3) run an analysis, and repeat 2) and 3) until you are satisfied with the results.

==Question 2==
===QUESTION 2a===
Your screenshot of the 3' part of the alignment should look something like this: [[Image:Seaview-Q2-aligned.png|Seaview showing aligned sequences]]
===QUESTION 2b===
#Your tree should look like this: [[Image:Seaview-Q2-tree.png|Seaview tree]]
#There are three clusters: One for Alpha-A (birds only), one for Alpha-D (birds only), and one for Alpha 1 + Alpha 2 (Mammals). The idea is here that birds and mammals are not intermixed, so they are "naturally" placed in a taxonomical sense.
#Alpha-A and Alpha-D are obviously in two different clusters — that must necessarily mean that the split between them is old. Since both Alpha-A and Alpha-D exist in all the three birds we included, the split must be older than the last common ancestor to the birds.
#Alpha-1 and Alpha-2 seem to be much more closely related.
===QUESTION 2c===
There is a single stretch of >15 nucleotides (23 to be precise) which is perfectly conserved. Its sequence is <tt>ACCAAGACCTACTTCCCCCACTT</tt>.

==Question 3==
The sequences are translated using Virtual Ribosome, giving rise to the following FASTA file:

>pigeon_alpha-D-globin
MLTDSDKKLVLQVWEKVIRHPDCGAEALERLFTTYPQTKTYFPHFDLHHGSDQVRNHGKK
VLAALGNAVKSLGNLSQALSDLSDLHAYNLRVDPVNFKLLAQCFHVVLATHLGNDYTPEA
HAAFDKFLSAVCTVLAEKYR*
>pigeon_alpha-A-globin
MVLSANDKSNVKAVFGKIGGQAGDLGGEALERLFITYPQTKTYFPHFDLSHGSAQIKGHG
KKVAEALVEAANHIDDIAGALSKLSDLHAQKLRVDPVNFKLLGHCFLVVVAVHFPSLLTP
EVHASLDKFVCAVGTVLTAKYR*
>duck_alpha-D-globin
MLTAEDKKLIVQVWEKVAGHQEEFGSEALQRMFLAYPQTKTYFPHFDLHPGSEQVRGHGK
KVAAALGNAVKSLDNLSQALSELSNLHAYNLRVDPVNFKLLAQCFQVVLAAHLGKDYSPE
MHAAFDKFLSAVAAVLAEKYR*
>duck_alpha-A-globin
MVLSAADKTNVKGVFSKIGGHAEEYGAETLERMFIAYPQTKTYFPHFDLQHGSAQIKAHG
KKVAAALVEAVNHIDDIAGALSKLSDLHAQKLRVDPVNFKFLGHCFLVVVAIHHPAALTP
EVHASLDKFMCAVGAVLTAKYR*
>Goat_alpha-i-globin
MVLSAADKSNVKAAWGKVGGNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
EKVAAALTKAVGHLDDLPGTLSDLSDLHAHKLRVDPVNFKLLSHSLLVTLACHLPNDFTP
AVHASLDKFLANVSTVLTSKYR*
>Goat_alpha-ii-globin
MVLSAADKSNVKAAWGKVGSNAGAYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
EKVAAALTKAVGHLDDLPGTLSDLSDLHAHKLRVDPVNFKLLSHSLLVTLACHHPSDFTP
AVHASLDKFLANVSTVLTSKYR*
>Horse_alpha-1_globin
MVLSAADKTNVKAAWSKVGGHAGEFGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHG
KKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTP
AVHASLDKFLSSVSTVLTSKYR*
>Horse_alpha-2_globin
MVLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHG
QKVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTP
AVHASLDKFLSSVSTVLTSKYR*
>Chicken_alpha-D
MLTAEDKKLIQQAWERAASHQEEFGAEALTRMFTTYPQTKTYFPHFDLSPGSDQVRGHGK
KVLGALGNAVKNVDNLSQAMAELSNLHAYNLRVDPVNFKLLSQCIQVVLAVHMGKDYTPE
VHAAFDKFLSAVSAVLAEKYR*
>Chicken_alpha-A
MVLSAADKNNVKGIFTKIAGHAEEYGAETLERMFTTYPPTKTYFPHFDLSHGSAQIKGHG
KKVVAALIEAANHIDDIAGTLSKLSDLHAHKLRVDPVNFKLLGQCFLVVVAIHHPAALTP
EVHASLDKFLCAVGTVLTAKYR*

Subsequently, they are aligned with Clustal Omega.

Observations:
* By and large the same tree on protein level as on DNA level (small differences in the branch lengths).
* Now, two completely conserved regions of >5 amino acids are seen. Their sequences are <tt>TKTYFPHFDL</tt> and <tt>LRVDPVNFK</tt>.

==Question 4==
FASTA file:

>Sheep_U00659
ATGGCCCTGTGGACACGCCTGGTGCCCCTGCTGGCCCTGCTGGCACTCTGGGCCCCCGCC
CCGGCCCACGCCTTCGTCAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGAGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGCCGGGAGGTGGAGGGC
CCCCAGGTGGGGGCGCTGGAGCTGGCCGGAGGCCCCGGCGCGGGTGGCCTGGAGGGGCCC
CCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCGCCGGCGTCTGCTCTCTCTACCAGCTG
GAGAACTACTGTAACTAG
>Pig_AY044828
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242098
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242100
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242101
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTGGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Pig_AY242109
ATGGCCCTGTGGACGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCGCTCTGGGCGCCCGCC
CCGGCCCAGGCCTTCGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAGGCGCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCCAAGGCCCGTCGGGAGGCGGAGAAC
CCTCAGGCAGGTGCCGTGGAGCTGGGCGGAGGCCTGGGCGGCCTGCAGGCCCTGGCGCTG
GAGGGGCCCCCGCAGAAGCGTGGCATCGTAGAGCAGTGCTGCACCAGCATCTGTTCCCTC
TACCAGCTGGAGAACTACTGCAACTAG
>Dog_V00179
ATGGCCCTCTGGATGCGCCTCCTGCCCCTGCTGGCCCTGCTGGCCCTCTGGGCGCCCGCG
CCCACCCGAGCCTTCGTTAACCAGCACCTGTGTGGCTCCCACCTGGTAGAGGCTCTGTAC
CTGGTGTGCGGGGAGCGCGGCTTCTTCTACACGCCTAAGGCCCGCAGGGAGGTGGAGGAC
CTGCAGGTGAGGGACGTGGAGCTGGCCGGGGCGCCTGGCGAGGGCGGCCTGCAGCCCCTG
GCCCTGGAGGGGGCCCTGCAGAAGCGAGGCATCGTGGAGCAGTGCTGCACCAGCATCTGC
TCCCTCTACCAGCTGGAGAATTACTGCAACTAG
>OwlMonkey_J02989
ATGGCCCTGTGGATGCACCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCCGAG
CCAGCCCCGGCCTTTGTGAACCAGCACCTGTGCGGCCCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGTTTCTTCTACGCACCCAAGACCCGCCGGGAGGCGGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGTGGGGGCTCTATCACGGGCAGCCTGCCACCCTTG
GAGGGTCCCATGCAGAAGCGTGGCGTCGTGGATCAGTGCTGCACCAGCATCTGCTCCCTC
TACCAGCTGCAGAACTACTGCAACTAG
>Human_AY138590
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>GreenMonkey_X61092
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCGGTCCCGGCCTTTGTGAACCAGCACCTGTGCGGCTCCCACCTGGTGGAAGCCCTCTAC
CTGGTGTGCGGGGAGCGAGGCTTCTTCTACACGCCCAAGACCCGCCGGGAGGCAGAGGAC
CCGCAGGTGGGGCAGGTAGAGCTGGGCGGGGGCCCTGGCGCAGGCAGCCTGCAGCCCTTG
GCGCTGGAGGGGTCCCTGCAGAAGCGCGGCATCGTGGAGCAGTGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>Human_J00265
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGCGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>Chimp_X61089
ATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGGTGCTGCTGGCCCTCTGGGGACCTGAC
CCAGCCTCGGCCTTTGTGAACCAACACCTGTGCGGCTCCCACCTGGTGGAAGCTCTCTAC
CTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGAC
CTGCAGGTGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTG
GCCCTGGAGGGGTCCCTGCAGAAGCGTGGTATCGTGGAACAATGCTGTACCAGCATCTGC
TCCCTCTACCAGCTGGAGAACTACTGCAACTAG
>GuineaPig_K02233
ATGGCTCTGTGGATGCATCTCCTCACCGTGCTGGCCCTGCTGGCCCTCTGGGGGCCCAAC
ACTAATCAGGCCTTTGTCAGCCGGCATCTGTGCGGCTCCAACTTAGTGGAGACATTGTAT
TCAGTGTGTCAGGATGATGGCTTCTTCTATATACCCAAGGACCGTCGGGAGCTAGAGGAC
CCACAGGTGGAGCAGACAGAACTGGGCATGGGCCTGGGGGCAGGTGGACTACAGCCCTTG
GCACTGGAGATGGCACTACAGAAGCGTGGCATTGTGGATCAGTGCTGTACTGGCACCTGC
ACACGCCACCAGCTGCAGAGCTACTGCAACTAG
>Mouse_X04725
ATGGCCCTGTTGGTGCACTTCCTACCCCTGCTGGCCCTGCTTGCCCTCTGGGAGCCCAAA
CCCACCCAGGCTTTTGTCAAACAGCATCTTTGTGGTCCCCACCTGGTAGAGGCTCTCTAC
CTGGTGTGTGGGGAGCGTGGCTTCTTCTACACACCCAAGTCCCGCCGTGAAGTGGAGGAC
CCACAAGTGGAACAACTGGAGCTGGGAGGAAGCCCCGGGGACCTTCAGACCTTGGCGTTG
GAGGTGGCCCGGCAGAAGCGTGGCATTGTGGATCAGTGCTGCACCAGCATCTGCTCCCTC
TACCAGCTGGAGAACTACTGCAACTAA
>Chicken_AY438372
ATGGCTCTCTGGATCCGATCACTGCCTCTTCTGGCTCTCCTTGTCTTTTCTGGCCCTGGA
ACCAGCTATGCAGCTGCCAACCAGCACCTCTGTGGCTCCCACTTGGTGGAGGCTCTCTAC
CTGGTGTGTGGAGAGCGTGGCTTCTTCTACTCCCCCAAAGCCCGACGGGATGTCGAGCAG
CCCCTAGTGAGCAGTCCCTTGCGTGGCGAGGCAGGAGTGCTGCCTTTCCAGCAGGAGGAA
TACGAGAAAGTCAAGCGAGGGATTGTTGAGCAATGCTGCCATAACACGTGTTCCCTCTAC
CAACTGGAGAACTACTGCAACTAG
>SeaHare_AF160192
ATGAGCAAGTTCCTCCTCCAGAGCCACTCCGCCAACGCCTGCCTGCTCACCCTTCTGCTCACGCTGGCCT
CCAACCTCGACATATCCCTGGCCAACTTCGAGCACTCGTGCAACGGCTACATGCGGCCCCACCCGCGGGG
TCTGTGCGGCGAAGACCTGCACGTCATCATTTCCAACCTGTGCAGCTCTCTGGGGGGCAACAGGAGGTTC
CTGGCCAAGTACATGGTCAAAAGAGACACGGAAAATGTGAACGACAAGTTACGAGGGATCCTGCTCAATA
AGAAAGAAGCTTTCTCCTACTTGACCAAGAGAGAGGCCTCAGGCTCCATCACATGCGAATGTTGCTTCAA
CCAGTGTCGGATATTTGAGCTGGCTCAGTACTGCCGTCTGCCAGACCATTTCTTCTCCAGAATATCCAGA
ACCGGAAGGAGCAACAGTGGACATGCGCAGTTGGAGGACAACTTTAGTTA

==Question 5==
1. Yes, there are many gaps which are not multiples of 3 positions. The most obvious example is the second gap, which is 4 positions long (in all sequences but the Sea Hare, see below). The alignment algorithm is ''not'' aware that the sequences are protein coding, it only considers the DNA.

[[Image:Seaview-Q5.png]]

2. Sea Hare (a marine snail) stands out — this makes sense, since it is the only invertebrate.

3. It can be seen that the two human sequences are 100% identical (the distance is 0) — one of them can therefore be discarded — and for the pig, the following sequences are identical:

>Pig_AY044828
>Pig_AY242098

and

>Pig_AY242100
>Pig_AY242101

(two pig sequences can therefore be discarded).

==Question 6==
The sequences are translated using Virtual Ribosome, yielding the following sequences:

>Sheep_U00659
MALWTRLVPLLALLALWAPAPAHAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVEG
PQVGALELAGGPGAGGLEGPPQKRGIVEQCCAGVCSLYQLENYCN*
>Pig_AY044828
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242098
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242100
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242101
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Pig_AY242109
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAEN
PQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN*
>Dog_V00179
MALWMRLLPLLALLALWAPAPTRAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREVED
LQVRDVELAGAPGEGGLQPLALEGALQKRGIVEQCCTSICSLYQLENYCN*
>OwlMonkey_J02989
MALWMHLLPLLALLALWGPEPAPAFVNQHLCGPHLVEALYLVCGERGFFYAPKTRREAED
LQVGQVELGGGSITGSLPPLEGPMQKRGVVDQCCTSICSLYQLQNYCN*
>Human_AY138590
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>GreenMonkey_X61092
MALWMRLLPLLALLALWGPDPVPAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
PQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>Human_J00265
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>Chimp_X61089
MALWMRLLPLLVLLALWGPDPASAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN*
>GuineaPig_K02233
MALWMHLLTVLALLALWGPNTNQAFVSRHLCGSNLVETLYSVCQDDGFFYIPKDRRELED
PQVEQTELGMGLGAGGLQPLALEMALQKRGIVDQCCTGTCTRHQLQSYCN*
>Mouse_X04725
MALLVHFLPLLALLALWEPKPTQAFVKQHLCGPHLVEALYLVCGERGFFYTPKSRREVED
PQVEQLELGGSPGDLQTLALEVARQKRGIVDQCCTSICSLYQLENYCN*
>Chicken_AY438372
MALWIRSLPLLALLVFSGPGTSYAAANQHLCGSHLVEALYLVCGERGFFYSPKARRDVEQ
PLVSSPLRGEAGVLPFQQEEYEKVKRGIVEQCCHNTCSLYQLENYCN*
>SeaHare_AF160192
MSKFLLQSHSANACLLTLLLTLASNLDISLANFEHSCNGYMRPHPRGLCGEDLHVIISNL
CSSLGGNRRFLAKYMVKRDTENVNDKLRGILLNKKEAFSYLTKREASGSITCECCFNQCR
IFELAQYCRLPDHFFSRISRTGRSNSGHAQLEDNFS*

Subsequently, the sequences are aligned. Note that the gaps in the peptide alignment do not correspond to the gaps in the nucleotide alignment.
# There is a disagreement between the DNA and peptide alignment because
## the DNA alignment does not take codon boundaries into account, and
## the peptide alignment can take similarities between amino acids (conservative substitutions) into account.
# At the protein level, all the Pig sequences are now completely identical. Four of them can therefore be discarded.

==Question 7==

Yes, the alignments are different. None of the four methods solves the problem perfectly, but Clustal Omega and MAFFT are really close; they both place only one letter incorrectly, see below.

[[Image:EPB4.1_human.clustalo.fasta.png]]

Clustal Omega: Note the two K's aligned with V's to the left of the large gaps.

[[Image:EPB4.1_human.mafft.fasta.png]]

MAFFT: Note the three Q's aligned with E's to the right of the large gaps.

MUSCLE and Kalign make more errors.

==Question 8==
* Yes — all gaps are multiples of 3.
* Yes — since the DNA alignment is generated using a protein alignment as a scaffold.

ExGeany-Answers

2024-03-15T11:19:37Z

WikiSysop: Created page with "=Answers to the exercise in Plain text files and Geany= Answers by: Rasmus Wernersson and Henrik Nielsen == Question 1:== The file sizes are: 453 bytes: alpha_globin_OldMac.fsa 453 bytes: alpha_globin_Unix.fsa 461 bytes: alpha_globin_Windows.fsa The important thing to notice here is that DOS/Windows newlines actually consists of two bytes (CR + LF), whereas UNIX and the old Mac standard only use one byte. The 8 byte difference corresponds to the 8 lines of te..."

=Answers to the exercise in Plain text files and Geany=
Answers by: Rasmus Wernersson and Henrik Nielsen

== Question 1:==
The file sizes are:

453 bytes: alpha_globin_OldMac.fsa
453 bytes: alpha_globin_Unix.fsa
461 bytes: alpha_globin_Windows.fsa

The important thing to notice here is that DOS/Windows newlines actually
consists of two bytes (CR + LF), whereas UNIX and the old Mac standard only use
one byte.

The 8 byte difference corresponds to the 8 lines of text within the file:

001 >pigeon_alpha-globin-D
002 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
003 GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
004 GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
005 AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
006 CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
007 CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
008 TAA

== Question 2:==
Yes - inspecting the files in the associated programs (e.g. Word and FireFox)
reveals the _textual_ contents to be the same.

The file sizes differ dramatically:

29184 bytes: alpha_globin.doc
667 bytes: alpha_globin.html
855 bytes: alpha_globin.rtf

== Question 3:==
The <tt>alpha_globin.doc</tt> file cannot be opened, because it is not a text file. In other words, not every byte in the file can be interpreted as a character.

The HTML and RTF files also contain some extra information, but unlike the DOC file, the extra information is text based.

Contents of the HTML file:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body>
< PRE>
>pigeon_alpha-globin-D
ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA
TAA
< /PRE>
</body>
</html>

In this case (cleanly formatted HTML) it's easy to locate the original DNA
sequence.

To some degree it's possible to figure out what's going on in the RTF file -
the codes are basically about formatting:

Snippet from the file:
\f0\b\fs24 \cf0 >pigeon_alpha-globin-D\

\f1\b0 ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTG\
GAGCCGAGGCCCTGGAGAGGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTT\
GCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAG\
AGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACC\
CTGTCAACTTCAAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACAC\
CCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGA\

The Word file contain a HUGE amount of additional information in BINARY
form, this is why Geany refuses to open it. Opening other non-text files such as a JPG image
or an MP3 sound file will also fail in Geany.
Certain text editors are less critical with regards to
the files they open, but when the file is binary, the results will look very strange.

Here is a snippet of the alpha_globin.doc file as displayed by the Unix editor vim:
^@^@^@D^A^@^@^L^@^@^@P^A^@^@^M^@^@^@\^A^@^@^N^@^@^@h^A^@^@^O^@^@^@p^A^@^@^P^@^@^@
x^A^@^@^S^@^@^@<80>^A^@^@^Q^@^@^@<88>^A^@^@^B^@^@^@^P'^@^@^^^@^@^@^X^@^@^@>pigeon
_alpha-globin-D^@^@^^^@^@^@^D^@^@^@^@^@^@^@^^^@^@^@^T^@^@^@Rasmus Wernersson^@^@
^@^^^@^@^@^D^@^@^@^@^@^@^@^^^@^@^@^H^@^@^@Normal^@^@^^^@^@^@^T^@^@^@Rasmus Werner
sson^@^@^@^^^@^@^@^D^@^@^@1^@^@^@^^^@^@^@^X^@^@^@Microsoft Word 11.5.0^@^@^@@^@^@
^@^@FÃ#^@^@^@^@@^@^@^@^@âÄò<91><81>É^A@^@^@^@^@(<88>^V<92><81>É^A^C^@^@^@^A^@^@^@
^C^@^@^@^@^@^@^@^C^@^@^@^@^@^@^@^C^@^@^@^@^@^@^@G^@^@^@82^@^@þÿÿÿPICT20^@^@^@^@^C
I^BR^@^Q^Bÿ^L^@ÿþ^@^@^A,^@^@^A,^@^@^@^@^@^@^M´ ¯^@^@^@^@^@¡^Aò^@^DMSWD^@^^^@^A
^@^@^@^@^@^M´ ¯^@,^@^N÷@^KCourier New^@^C÷@^@^M^@%^@.^@^D^@^@^@^@^@(^AK^Aw^A>

Interestingly, it actually possible to get a glimpse of a few text-strings within
the mess of symbols, including the sequence name and the name (Rasmus Wernersson) of the
person who created the file.

==Question 4:==
Cleaned up sequence:

AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGA
GTTGCCGGGGTTGCAGGAGGAAACGGCTCTACCAATAGTGGAGCAAGGTTAATGTCCACA
CAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT
GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTG
ACTATTAAGGAGTTGCAGAAAGCTGCGATCGACTATTTCATTGATTATGCAGGAATGGAC
GAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA
AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCC
TCAATGGGACCAGATTTTACTAAGGCAAGCTATAGCACTTTTGGAACATGGACTGATATT
ACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT
TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCC
GGCGCCGCC

ExBlast-Answers2

2024-03-15T11:14:51Z

WikiSysop: Created page with "Answers to the BLAST exercise, by Henrik Nielsen. Values for database sizes etc. retrieved March 7, 2020 ==Part 1: Your first BLAST search== ===QUESTION 1.1=== * ''what is the identifier (Accession)?'' :OL351605 or M57671 (Note that the latter was also part of the sequence name for your query sequence!) * ''what is the alignment score ("Max score in bits")?'' :The max score is 780 bits (Raw score is 864) * ''what is the percent identity and query coverage?'' :100% * '..."

Answers to the BLAST exercise, by Henrik Nielsen. Values for database sizes etc. retrieved March 7, 2020

==Part 1: Your first BLAST search==

===QUESTION 1.1===
* ''what is the identifier (Accession)?''
:OL351605 or M57671 (Note that the latter was also part of the sequence name for your query sequence!)
* ''what is the alignment score ("Max score in bits")?''
:The max score is 780 bits (Raw score is 864)
* ''what is the percent identity and query coverage?''
:100%
* ''what is the E-value?''
:0.0 (actually, a number so small that it is rounded off to 0.0)
* ''are there any gaps in the alignment?''
:No, of course not, since the sequences are identical

===QUESTION 1.2===
* ''what is the identifier (Accession)?''
:NM_001185098 or NM_001185097 (or a handful more), they have the same score and are therefore equally good
* ''what is the alignment score ("max score")?''
:205
* ''what is the percent identity and query coverage?''
:identity: 74.49% and query coverage: 76%
* ''what is the E-value?''
:9.77E-48 (meaning 9.77×10-48)
* ''are there any gaps in the alignment?''
:Yes, there are five gaps in the query sequence and two gaps in the database sequence, totaling 15 positions.

===QUESTION 1.3===
* ''what is the identifier (Accession)?''
:NM_001185098 or NM_001185097 or NM_000207, they have the same score and are therefore equally good. Note that these are among the equally good hits found in the previous question.
* ''what is the alignment score ("max score")?''
:205
* ''what is the percent identity and query coverage?''
:identity: 74.49% and query coverage: 76%
* ''what is the E-value?''
:8.16E-51 (meaning 8.61×10-51)
* ''are there any gaps in the alignment?''
:Yes, there are exactly the same gaps as in the previous question.

===QUESTION 1.4===
''What are the sizes (in basepairs) of the databases we used for the two BLAST searches?''

nt: 1,347,152,378,063 letters (= basepairs), RefSeq_rna: 1,096,131,797 letters (= basepairs).

===QUESTION 1.5===

*''What is the ratio between the database sizes in the two BLAST searches?''
:1347152378063 / 1096131797 = 1229
*''What is the ratio between the E-values (for the best human hits) in the two BLAST searches? ''
:9.77E-48 / 8.16E-51 = 1197
:Note: since the E-values have only three significant digits, you cannot expect to get the exact same result.
:Also note, you can google "9.77E-48 / 8.16E-51" directly and the answer will show up in the results.
*''What is the relationship between database size and E-value for hits with identical alignment score?''
:The E-value is directly proportional to the database size.
:Note: Conceptually this is easy to understand - getting an alignment with the given score (205 bits) is more SIGNIFICANT in the smaller database. In larger database there is a larger chance of randomly picking up matches.
*''In conclusion: if the database size is doubled, what will happen to the E-value?''
:Each time the database size doubles, the E-value doubles as well.

==Part 2: Assessing the statistical significance of BLAST hits==

===QUESTION 2.1===
Report the sequence in '''FASTA''' format:

>random_d_sequence
TTCTGAAAGGTCCTCTCGATACTCG

(of course your particular sequences will not be identical to these)

===QUESTION 2.2===
*''Do you find any sequences that look like your input sequence (paste in a few example alignments in your report).''
:There will typically be several 100% identity hits, ''e.g.'':

****Alignment**** 1
Title: gi|2440392781|emb|OX421481.1| Eilema caniola genome assembly, chromosome: 20
Accession: OX421481
Length: 22119023
Max Score: 44.0
Bits: 40.9604
Identities: 22
Align_length: 22
Gaps: 0
%Ident: 100.00 %
Query Cover: 88 %
E value: 1.88e+00
TTCTGAAAGGTCCTCTCGATAC
||||||||||||||||||||||
TTCTGAAAGGTCCTCTCGATAC

*''What is the typical length of the hits (the alignment length)?'':
:Typically around 17-22 base pairs.

*''What is the typical % identity?'':
:90% - 100%

*''In what range are the bit-scores ("max score)?'':
:typically 30-40 bits.

*''What is the range of the E-values?'':
:1.88e+00 - 2.29e+01
:usually varying from 1 to 50 (occasionally, you might find hits as "good" as 0.1).
:'''Note''': we chose to use an E-value threshold of 50.0. The default is 0.05.

===QUESTION 2.3===
''What is the biological significance of the hits you found / is there any biological meaning?'':

This makes absolutely NO biological sense(!) The hits are real enough as such, they represent sequences that actually are in the database. But we know that our query sequences are completely random and therefore have no evolutionary relationship with the hits. The only reason we found our hits is that the database is so vast that we for for purely stochastic reasons happen upon sequences that are similar.

The E-values tell us precisely this: As described in the BLAST lecture, the alignment score will follow an extreme value distribution for those sequences that are not related to our query sequences, and the E-value is ''the expected number'' of spurious (unrelated) hits with the given alignment score or better, given the database size.

'''Note:''' Don't be confused by the difference between alignment score and bit score; bit score is simply the alignment score normalized by a constant factor which gives a result expressible in bits.

===QUESTION 2.4===
''Report the sequence in FASTA format'':

>seq_01
LTNNVNMHWTLPYTVSHVYVNPYSC

(again, your particular sequence will of course differ from this).

===QUESTION 2.5===

*''What is the typical length of the alignment and do they contain gaps?'':
:Typically 15-22. Rarely gaps, but several mismatches.

*''What is the range of E-values?'':
:Typically 100-1000

*''Try to inspect a few of the alignments in details ("+" means similar) - do you find any that look plausible, if we for a moment ignore the length/E-value?''
:Yes, maybe. See ''e.g.'' the alignment below, it has 77% identities (but it is way too short to be significant, as the E-value tells us).

****Alignment**** 1
Title: ref|WP_179589105.1| non-ribosomal peptide synthetase [Pigmentiphaga litoralis] >gb|NYE25977.1| amino acid adenylation domain-containing protein [Pigmentiphaga litoralis] >gb|NYE85097.1| amino acid adenylation domain- containing protein [Pigmentiphaga litoralis]
Accession: WP_179589105
Length: 1782
Max Score: 67.0
Bits: 30.4166
Identities: 8
Align_length: 22
Gaps: 0
%Ident: 36.36 %
Query Cover: 88 %
E value: 1.25e+02
LTNNVNMHWTLPYTVSHVYVNP
L ++ HW +P+T+SH++ +P
LAARISQHWCVPFTISHIFDHP

*''If we had used the default E-value cutoff of 10 would any hits have been found?'':
:No (note: the default is actually 0.05 now). Note the difference from the nucleotide database searches (whose E-values were typically in the range 1-50): if we had run BLASTN with an E-value threshold of 1000, we would have had many pages of hits for each query sequence.

===QUESTION 2.6===

*''If we compare the result from BLAST'ing random DNA sequences to random Peptide sequences - which kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?'':
:The risk of getting a false hit (an unrelated sequence with a "decent" E-value) is much larger when working with DNA sequences. Remember than we used 50 as E-value cut-off for BLASTN, while we used 1000 with BLASTP in order to see any hits at all.

==Part 3: using BLAST to transfer functional information by finding homologs==

===QUESTION 3.1===

*''Do we get any significant hits?''
:Yes, there are 20 hits with an E-value of "0.0" (''i.e.'' so small that is is rounded to zero) — and the next hits are also extremely significant. The first hit (S48754) furthermore has a query coverage of 100% and an identity of 100% (this is actually the source of our query).

*''What kind of genes (function) do we find?''
:All the high-quality hits are alkaline serine proteases from the genera ''Bacillus'' or ''Alkalihalobacillus'' — except some hits that are whole genome sequences.

===QUESTION 3.2===
Note 1: remember to use the ORF Finder in Virtual Ribosome! Since we are told the sequence is a full-length transcript, we can assume that the START and STOP codons are included and set the ORF finder to "Start codon: Any" (in this case, it would have given the same result to use "Start codon: Strict").

Note 2: you can choose the standard genetic code (Table 1) or alternatively Table 11 (Bacterial and Plant Plastid). The only difference is that Table 11 allows some extra, rarely occurring, start codons.

* ''Report your translated protein sequence in FASTA format.'':

>Unknown_transcript01_rframe2_ORF
MKKPLGKIVASTALLISVAFSSSIASAAEEAKEKYLIGFNEQEAVSEFVEQVEANDEVAI
LSEEEEVEIELLHEFETIPVLSVELSPEDVDALELDPAISYIEEDAEVTTMAQSVPWGIS
RVQAPAAHNRGLTGSGVKVAVLDTGISTHPDLNIRGGASFVPGEPSTQDGNGHGTHVAGT
IAALNNSIGVLGVAPSAELYAVKVLGASGSGSVSSIAQGLEWAGNNGMHVANLSLGSPSP
SATLEQAVNSATSRGVLVVAASGNSGAGSISYPARYANAMAVGATDQNNNRASFSQYGAG
LDIVAPGVNVQSTYPGSTYASLNGTSMATPHVAGAAALVKQKNPSWSNVQIRNHLKNTAT
SLGSTNLYGSGLVNAEAATR

*''Do we find any conserved protein domains?'':
:Yes, there is a "Peptidase S8" domain. You can see it by clicking the Graphic Summary tab.

[[image:Peptidases_S8.png|center|frame|Conserved protein domains found by the NCBI Blast server]]


*''Do we find any significant hits? (E-value?)'':
:Yes, a lot. The first many hits have an E-value of 0.0, and hit #100 is still very significant (3e-98) — note that by default, only the top 100 hits are shown!

*''Are all the best hits the same category of enzymes?'':
:Yes, they are alkaline proteases (except a few that are hypothetical proteins).
:Note that you can click the Accession code for a hit and go directly to the corresponding entry in the database.

*''From what you have seen, what is best for identifying intermediate quality hits - DNA or Protein BLAST?'':
:Protein BLAST (BLASTP). If you have very high quality hits, they can be identified by both methods, but if the evolutionary distance is larger, BLASTP is clearly better.
:Note: Recall from the PyMOL exercises that information between distant genes/proteins are conserved from: Structure > Peptide Sequence > Nucleotide sequence. So when the evolutionary distance is larger, blastp would generally give better hits than blastn.

===QUESTION 3.3===
'''STEP 1 - cleaning up the sequence: '''

*''Subquestion: convert the sequence to FASTA format (manually, in JEdit) and quote it in your report.''

>CLONE12
AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGA
GTTGCCGGGGTTGCAGGAGGAAACGGCTCTACCAATAGTGGAGCAAGGTTAATGTCCACA
CAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT
GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTG
ACTATTAAGGAGTTGCAGAAAGCTGCGATCGACTATTTCATTGATTATGCAGGAATGGAC
GAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA
AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCC
TCAATGGGACCAGATTTTACTAAGGCAAGCTATAGCACTTTTGGAACATGGACTGATATT
ACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT
TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCC
GGCGCCGCC

'''STEP 2 - thinking about the task: '''

*''Subquestion: Give a summary of your considerations.''
**''Based on the information given: is the sequence protein-coding? ''
::Yes — we know this because the PCR primers used to clone the sequence target '''known enzymes'''. Therefore, it will make sense to try to translate the sequence using Virtual Ribosome.
:*''If it is, can you trust it will contain both a START and STOP codon? ''
::No — the PCR primers used to clone the sequence target '''the middle of the sequence''', in other words we must assume that our sequence is a fragment. Therefore, the ORF finder in Virtual Ribosome should be set to Start codon: None.
:*''Do we know if the sequence is sense or anti-sense? ''
::No — the PCR process amplifies a stretch of double-stranded DNA. Therefore, we should let Virtual Ribosome search in '''all 6 reading frames'''.

'''STEP 3 - Performing the database search''':
We want to use BLAST to search the large databases. Let's therefore try the following:
# BLASTN
# Translate to protein (using Virtual Ribosome).
# BLASTP
Both when doing BLASTN and BLASTP we will use the NR database in order to search as broadly as possible. It would not make sense to use an organism-specific database when we don't know which organism our sequence stems from.

1) BLASTN. When trying BLASTN against NR we get some borderline significant results, but observe how small the query coverage percentages are (check also the Graphic Summary tab!).

[[image:NCBI BlastN_CLONE12 new_version.png]]
[[image:NCBI BlastN_CLONE12 Graphic Summary.png]]

There is simply nothing in the entire NR database that has enough similarity to our whole query sequence. A search on the DNA level is only suited for finding very close hits.

2) Translate using Virtual Ribosome with the settings we chose under Step 2 above.

The result from the ORF finder:

VIRTUAL RIBOSOME
----------------
Translation table: Standard SGC0

>CLONE12_rframe1_ORF
Reading frame: 1

N G H G T H V A G T V A A V N N N G I G V A G V A G G N G S
5' AACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAATAATAATGGTATCGGAGTTGCCGGGGTTGCAGGAGGAAACGGCTCT 90
..........................................................................................

T N S G A R L M S T Q I F N S D G D Y T N S E T L V Y R A I
5' ACCAATAGTGGAGCAAGGTTAATGTCCACACAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCTTGTGTACAGAGCCATT 180
.....................>>>..................................................................

V Y G A D N G A V I S Q N S W G S Q S L T I K E L Q K A A I
5' GTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTGGGGTAGTCAGTCTCTGACTATTAAGGAGTTGCAGAAAGCTGCGATC 270
.........................................................)))............)))...............

D Y F I D Y A G M D E T G E I Q T G P M R G G I F I A A A G
5' GACTATTTCATTGATTATGCAGGAATGGACGAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATTTATAGCTGCCGCCGGA 360
........................>>>..............................>>>..............................

N D N V S T P N M P S A Y E R V L A V A S M G P D F T K A S
5' AACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACGGGTTTTAGCTGTGGCCTCAATGGGACCAGATTTTACTAAGGCAAGC 450
........................>>>....................................>>>........................

Y S T F G T W T D I T A P G G D I D K F D L S E Y G V L S T
5' TATAGCACTTTTGGAACATGGACTGATATTACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATACGGAGTTCTCAGCACT 540
...............................................................)))........................

Y A D N Y Y A Y G E G T S M A C P H V A G A A
5' TATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGCTTGTCCACATGTCGCCGGCGCCGCC 609
.......................................>>>...........................

('''Tip:''' Remember that you can get the sequence in FASTA format via the FASTA link on the result page):

>CLONE12_rframe1_ORF
NGHGTHVAGTVAAVNNNGIGVAGVAGGNGSTNSGARLMSTQIFNSDGDYTNSETLVYRAI
VYGADNGAVISQNSWGSQSLTIKELQKAAIDYFIDYAGMDETGEIQTGPMRGGIFIAAAG
NDNVSTPNMPSAYERVLAVASMGPDFTKASYSTFGTWTDITAPGGDIDKFDLSEYGVLST
YADNYYAYGEGTSMACPHVAGAA

3) BLASTP



We get several very significant hits. When looking at the top hits and disregarding "hypothetical" and "uncharacterized" proteins, we can see that the rest are almost all serine proteases. Some of them are described as belonging to the of the S8 family.

[[image:NCBI BlastP_CLONE12_rframe1_ORF new version.png]]

Let's take a closer look at the first hit that is not "uncharacterized":
[[image:NCBI_BlastP_CLONE12_best_hit.png]]

Note that although it is not a perfect hit (our query sequence not existing in the database) it looks reasonable: the alignment covers a large part of the query with Identity of 54% and Similarity (Positives) of 69%.

Taken together with the fact that almost all the best non-hypothetical hits are serine proteases, we have a very strong indication that our mystery sequence, CLONE12, is a peptidase or protease of the S8 family.

==Part 4: BLAST'ing Genomes==

===QUESTION 4.1===
''What information is given about the relationship between this gene and the gene "HTA1"?''

They are nearly identical ("one of two nearly identical (see also HTA1) subtypes").

Protein sequence:
>YBL003C
MSGGKGGKAGSAAKASQSRSAKAGLTFPVGRVHRLLRRGNYAQRIGSGAPVYLTAVLEYL
AAEILELAGNAARDNKKTRIIPRHLQLAIRNDDELNKLLGNVTIAQGGVLPNIHQNLLPK
KSAKTAKASQEL*

===QUESTION 4.2===
*''How many high-confidence hits do we get?'':
:3 — HTA1, HTA2 and HTZ1.
:Note: If you click on the Gene links for the two top hits, you will see that one is HTA1 and the other is HTA2.

*''Do the hits make sense, from what you have read about HTA2 at the SGD webpage?'':
:Yes; HTA1 and HTA2 are indeed nearly identical (only 2 amino acids differ).

===QUESTION 4.3===

*''How many high-confidence hits (with E-value better than 10-10) are found?''
:Answer: 29.

FAQ

2024-03-15T10:52:28Z

WikiSysop: /* Sequence weighting / Clustering */

== Practical information ==

=== Exam ===
* ''How do I find out where and when the exam is held?''
At http://www.eksamensplan.dtu.dk/ .

* ''Which online platform will you use for the exam?''
This year we will be using Digital Exam (the new interface) which is accessed via https://eksamen.dtu.dk/ .

We will ''not'' be using the old interface via http://onlineeksamen.dtu.dk/ .



=== Re-exam ===
* ''When will there be a re-exam?''
For those of you who either do not pass, or do not hand in, or signed off the exam, there will be an oral re-exam during May. The exact date and time is negotiable. Please note that you have to sign up for the re-exam in the study admin system.

* ''How will the re-exam take place?''
You draw a random written question which contains a minor practical task (an alignment, a BLAST search, a phylogeny or similar). Then you have 30 minutes preparation time to solve the given task using your own computer. You will have access to the net. Leave all relevant browser windows/tabs open, so that you afterwards can show how you have done. The examination will then last approximately 20 minutes and begin with your own presentation of what you have done to solve the task. Depending on how long time your presentation takes, we will also ask questions in other parts of the course curriculum. The grade will be given immediately after the exam.

== Bioinformatics in general ==

=== Protein to DNA ===
* ''How can I convert my protein sequence to DNA in FASTA format?''
Generally, you cannot "convert" protein sequence to DNA sequence, there is simply some information missing (the same protein sequence can originate from many different DNA sequences due to the redundancy in the genetic code). But if you have located a protein in UniProt, you can usually find one or more cross-references to the nucleotide sequence databases.

== GenBank ==
=== LOCUS / Accession / Version ===
* ''I'm in doubt about the difference between Locus, Accession and Version in GenBank .''
Each entry in GenBank has one and only one '''Locus''' code, which identifies the entry. Then it has one ''or more'' '''Accession''' codes, of which one is usually identical to the Locus code. Multiple accession codes suggest that the entry is a fusion of several entries from an earlier version of the database. Finally, the '''Version''' is the Locus code followed by a dot and a number which refers to the version of the ''sequence'' in the entry. If the number is higher than 1, it means that the sequence has been updated since the creation of the entry. See example below.
LOCUS AH002844 4969 bp DNA linear PRI 10-JUN-2016
DEFINITION Homo sapiens insulin (INS) gene, complete cds.
ACCESSION AH002844 J00265 J00268
VERSION AH002844.2

== UniProt ==

=== Old UniProt questions ===
* ''I' trying to solve this UniProt question in an old exam set, and I cannot get the number of hits to conform with the answer. What am I doing wrong?''
The answers are not updated every year. You cannot expect the number of hits to stay constant, since the database is growing over time. If your ''search string'' conforms to the answer, it's fine.
* ''But I cannot get the search string to conform with the answer, either?''
This is because of the UniProt 2022 interface change. Unfortunately, they also changed the syntax of the search strings.

=== Transmembrane proteins ===
* ''I'm in doubt about the difference between "<tt>annotation:(type:transmem)</tt>" og "<tt>annotation:(type:location "pass membrane")</tt>". The second one gives many more hits than the first one. Why?''
The difference is that search string #1 refers to a Feature Table (FT line) annotation and search string #2 refers to a comment (CC line) annotation. Thereby, #1 chooses only those proteins that have information about ''where'' in the sequence the transmembrane segments are, while #2 chooses all proteins known to have at least one transmembrane segment.

== Pairwise alignment ==
=== Gaps ===
* ''What are gaps precisely?''
Remember that a pairwise alignment is a hypothesis about two sequences being related through evolution. A gap is then a hypothesis about an insertion or a deletion that has taken place during that evolution.

* ''Why do you say there are only four gaps in the alignment shown here? Below the alignment, is is written that there are seven?''
[[File:gaps-2014-1g.JPG]]

Gaps can have different lengths; a gap can comprise one or several positions. In the example, there are three gaps of length one, and one gap of length four. That gives seven ''positions'' with gaps in total, but still only four gaps.

== Protein structure, PDB & PyMOL ==


=== Fetch in PyMOL ===
* ''What do I do if the <tt>fetch</tt> command does not work in PyMOL?''
It is perfectly possible to use PyMOL without <tt>fetch</tt>:
# Go to the [https://www.rcsb.org/ PDB homepage] and locate the structure you wanted to fetch;
# Click Download files in the top right corner, choose PDB format, and download the PDB file to your own computer;
# Click File → Open in the PyMOL menu and choose the file you just downloaded.

=== Background ===
* ''Why have you, in several answers to exam questions, made the background white?''
White background is usually better if you want to print the result (particularly on an inkjet printer!).

== BLAST ==

=== Choice of database ===

* ''I have problems choosing the right database when BLASTing, can you give some guidance?''
Here are some rules of thumb:
* For both '''blastp''' and '''blastn''', you should use nr (called nr/nt in blastn), if you want to search as widely as possible ("everything").
* In '''blastp''', you can use swissprot, if you specifically want to search for a ''reviewed entry'' from UniProt (UniProtKB/SwissProt).
* In '''blastp''', you can use pdb, if you specifically want to search for a ''structure''.
* When using '''PSI-BLAST''', you should always choose nr for ''constructing'' the PSSM, so that there is as much material as possible to work with. Then, you can choose a more narrow database when ''reusing'' the PSSM in a search.
* In '''blastn''', you can use Human genomic + transcript or Mouse genomic + transcript, if you specifically want to search in one of these two organisms.
* In both '''blastp''' og '''blastn''' you can use the Organism field to specify an organism or a taxonomic group.

=== Error: "Query contains no sequence data" ===
* ''Help! BLAST gives me the error message "Message ID#32 Error: Query contains no data: Query contains no sequence data" even though I pasted in a FASTA sequence!
Occasionally, the input field in BLAST fails to "understand" newlines and regards your input as one long line (containing nothing but a FASTA header). The workaround is to remove the header and only paste your sequence.

== Logo plots and weight matrices ==

=== Sequence logos ===

* ''When making a sequence logo, should I choose [http://weblogo.berkeley.edu/ WebLogo] or [https://services.healthtech.dtu.dk/services/Seq2Logo-2.0/ Seq2Logo]?''
For '''amino acid sequences''', you can use both. However, in Seq2Logo you should remember to set the Logo type to Shannon (where Kullback-Leibler is the default). In addition, you should set Clustering method to None and weight on prior to 0 (zero), if you want results that are comparable to those of WebLogo.
For '''nucleotide sequences''', you should use WebLogo.

=== WebLogo ===

* ''WebLogo is giving me the error message "Error: Invalid input format does not conform to FASTA, CLUSTAL, or Flat", but I know my file is a valid FASTA file!?''
Yes, WebLogo sometimes gives this error without reason when you try to upload a file. The workaround is to paste the contents of your file into the window instead of uploading the file.

* ''There is a 100% conserved position in my data. Shouldn't the information content then be 2 bits (nucleotides) or 4.3 bits (amino acids)? Why is it lower in the WebLogo output?''
That's because of the limited size of your data set. WebLogo by default applies a "[http://weblogo.berkeley.edu/info.html#ssc Small Sample Correction]" that shows only how much information is ''significantly'' above random. The smaller the sample, the lower the significant information. You can deselect this, if you want.

=== Pseudocounts ===

* ''What is actually the purpose of pseudocounts?''
The purpose of pseudocounts is to compensate for the fact that we have a limited amount of data. From the amino acids we have observed at a certain position, we try to estimate what the probabilities of the amino acids would have been if we had access to an infinitely large amount of sequences. The smaller our dataset is, the larger the importance of pseudocounts will be.

* ''When you are looking up e.g. q(G|S) in the [http://www.cbs.dtu.dk/dtucourse/27611spring2016/BLOSUM62-probabilities.txt table], do you go horizontally first, or vertically?''
The amino acid before "|" determines the column, while the amino acid after "|" determines the row. To look up q(G|S), use column G and row S, yielding 0.07.

* ''How do I do the calculations, if I'm asked to ignore pseudocounts?''
Ignoring pseudocounts means setting weight on pseudo counts (weight on prior, β) equal to 0. Then, pa = fa, and there is no reason to calculate ga.

* ''When do you choose β = 10,000?''
In one of the exercise questions, we set β to 10,000 (an arbitrary but very high value) in order to simplify calculations. When β is very high, α becomes effectively zero, and you can set pa = ga. In practical use, you would never use such a high value.

=== Sequence weighting / Clustering ===
* ''Where in the [https://teaching.healthtech.dtu.dk/material/22111/Estimationofpseudocounts_new+examples.pdf equations] do I find how to do sequence weighting?''
You don't. Sequence weighting is not something you will be asked to do manually.

* ''When would you choose clustering rather than no clustering?''
Briefly, sequence weighting / clustering means that sequences that are very similar are grouped together and weighted down so that they count as one sequence in the calculation of the observed amino acid frequencies (fa). This can make a big difference if a group of sequences are very similar, see the example with the small training set in the weight matrix exercise.

=== EasyPred ===
* ''Help! I cannot open the para.dat file ("Parameters for prediction method") from EasyPred.''
When looking at "Parameters for prediction method", it is important that the ''right-click'' the link and choose Save link as.... If you just click the link, your browser may think that it is a videofile, because some video files have the extension ".dat". If the right-click approach does not work either, then try in another browser. After you have downloaded the para.dat file, you can open it in jEdit (or another plain text editor).

* ''Help! I pasted in the full sequence of a protein under "evaluation examples", but get an error such as: "Error reading eval data: Peptide number 1: "MEINVS..." are not 20 long"''
The problem is that you did not paste the sequence in FASTA format, i.e. including the header. EasyPred looks for the header (with the ">" character) in order to determine whether the input is in FASTA format (a full sequence that should be scanned) or in peptide format (where each peptide must have the same length as the model).

== PSI-BLAST ==

* ''I'm in doubt about when to use PSI-BLAST instead of normal BLASTP.''
PSI-BLAST should be used, if a normal BLASTP doesn't find what you are looking for.

If you e.g. are asked to find a matching structure, but normal BLASTP doesn't give you any PDB hits, it could be a good idea to run PSI-BLAST (running against '''nr''' first so that there are some hits to build a profile from, and then after 2-3 rounds download the PSSM and use it to search in PDB, see details in the PSI-BLAST exercise).

A similar situation could be that you are asked to find a homolog to a query protein in a specific organism or taxonomic group, but a normal BLASTP doesn't give you any hits in that group. Then, you can use the same procedure (again, see the PSI-BLAST exercise for details).

Yet another situation could be that you are asked to find a plausible function for a protein, but a normal BLASTP only gives you hypothetical hits; here, PSI-BLAST cold also be a possibility.

== Phylogeny ==

* ''What is the difference between Taxonomy and Phylogeny?''
Taxonomy is a broader concept than Phylogeny; it implies classification of organisms by any method (including Linnaeus' classification). Phylogeny implies evolutionary relationships. Modern taxonomy is, to a very high degree, phylogeny-based. Note that phylogeny is not necessarily molecular, it can be based on morphological characters such as organs or bones (the latter, of course, being tremendously important in classification of fossils).

=== 2017 exam, question 3b ===
* ''The tree I got from the Common Tree builder does not look like the one in the answer?''
You need to mark the include unranked (phylogenetic) taxa checkbox in the Common tree page to see the Terrabacteria group that links the two Gram-positive phyla together.

FAQ

2024-03-15T10:51:49Z

WikiSysop: /* Sequence logos */

== Practical information ==

=== Exam ===
* ''How do I find out where and when the exam is held?''
At http://www.eksamensplan.dtu.dk/ .

* ''Which online platform will you use for the exam?''
This year we will be using Digital Exam (the new interface) which is accessed via https://eksamen.dtu.dk/ .

We will ''not'' be using the old interface via http://onlineeksamen.dtu.dk/ .



=== Re-exam ===
* ''When will there be a re-exam?''
For those of you who either do not pass, or do not hand in, or signed off the exam, there will be an oral re-exam during May. The exact date and time is negotiable. Please note that you have to sign up for the re-exam in the study admin system.

* ''How will the re-exam take place?''
You draw a random written question which contains a minor practical task (an alignment, a BLAST search, a phylogeny or similar). Then you have 30 minutes preparation time to solve the given task using your own computer. You will have access to the net. Leave all relevant browser windows/tabs open, so that you afterwards can show how you have done. The examination will then last approximately 20 minutes and begin with your own presentation of what you have done to solve the task. Depending on how long time your presentation takes, we will also ask questions in other parts of the course curriculum. The grade will be given immediately after the exam.

== Bioinformatics in general ==

=== Protein to DNA ===
* ''How can I convert my protein sequence to DNA in FASTA format?''
Generally, you cannot "convert" protein sequence to DNA sequence, there is simply some information missing (the same protein sequence can originate from many different DNA sequences due to the redundancy in the genetic code). But if you have located a protein in UniProt, you can usually find one or more cross-references to the nucleotide sequence databases.

== GenBank ==
=== LOCUS / Accession / Version ===
* ''I'm in doubt about the difference between Locus, Accession and Version in GenBank .''
Each entry in GenBank has one and only one '''Locus''' code, which identifies the entry. Then it has one ''or more'' '''Accession''' codes, of which one is usually identical to the Locus code. Multiple accession codes suggest that the entry is a fusion of several entries from an earlier version of the database. Finally, the '''Version''' is the Locus code followed by a dot and a number which refers to the version of the ''sequence'' in the entry. If the number is higher than 1, it means that the sequence has been updated since the creation of the entry. See example below.
LOCUS AH002844 4969 bp DNA linear PRI 10-JUN-2016
DEFINITION Homo sapiens insulin (INS) gene, complete cds.
ACCESSION AH002844 J00265 J00268
VERSION AH002844.2

== UniProt ==

=== Old UniProt questions ===
* ''I' trying to solve this UniProt question in an old exam set, and I cannot get the number of hits to conform with the answer. What am I doing wrong?''
The answers are not updated every year. You cannot expect the number of hits to stay constant, since the database is growing over time. If your ''search string'' conforms to the answer, it's fine.
* ''But I cannot get the search string to conform with the answer, either?''
This is because of the UniProt 2022 interface change. Unfortunately, they also changed the syntax of the search strings.

=== Transmembrane proteins ===
* ''I'm in doubt about the difference between "<tt>annotation:(type:transmem)</tt>" og "<tt>annotation:(type:location "pass membrane")</tt>". The second one gives many more hits than the first one. Why?''
The difference is that search string #1 refers to a Feature Table (FT line) annotation and search string #2 refers to a comment (CC line) annotation. Thereby, #1 chooses only those proteins that have information about ''where'' in the sequence the transmembrane segments are, while #2 chooses all proteins known to have at least one transmembrane segment.

== Pairwise alignment ==
=== Gaps ===
* ''What are gaps precisely?''
Remember that a pairwise alignment is a hypothesis about two sequences being related through evolution. A gap is then a hypothesis about an insertion or a deletion that has taken place during that evolution.

* ''Why do you say there are only four gaps in the alignment shown here? Below the alignment, is is written that there are seven?''
[[File:gaps-2014-1g.JPG]]

Gaps can have different lengths; a gap can comprise one or several positions. In the example, there are three gaps of length one, and one gap of length four. That gives seven ''positions'' with gaps in total, but still only four gaps.

== Protein structure, PDB & PyMOL ==


=== Fetch in PyMOL ===
* ''What do I do if the <tt>fetch</tt> command does not work in PyMOL?''
It is perfectly possible to use PyMOL without <tt>fetch</tt>:
# Go to the [https://www.rcsb.org/ PDB homepage] and locate the structure you wanted to fetch;
# Click Download files in the top right corner, choose PDB format, and download the PDB file to your own computer;
# Click File → Open in the PyMOL menu and choose the file you just downloaded.

=== Background ===
* ''Why have you, in several answers to exam questions, made the background white?''
White background is usually better if you want to print the result (particularly on an inkjet printer!).

== BLAST ==

=== Choice of database ===

* ''I have problems choosing the right database when BLASTing, can you give some guidance?''
Here are some rules of thumb:
* For both '''blastp''' and '''blastn''', you should use nr (called nr/nt in blastn), if you want to search as widely as possible ("everything").
* In '''blastp''', you can use swissprot, if you specifically want to search for a ''reviewed entry'' from UniProt (UniProtKB/SwissProt).
* In '''blastp''', you can use pdb, if you specifically want to search for a ''structure''.
* When using '''PSI-BLAST''', you should always choose nr for ''constructing'' the PSSM, so that there is as much material as possible to work with. Then, you can choose a more narrow database when ''reusing'' the PSSM in a search.
* In '''blastn''', you can use Human genomic + transcript or Mouse genomic + transcript, if you specifically want to search in one of these two organisms.
* In both '''blastp''' og '''blastn''' you can use the Organism field to specify an organism or a taxonomic group.

=== Error: "Query contains no sequence data" ===
* ''Help! BLAST gives me the error message "Message ID#32 Error: Query contains no data: Query contains no sequence data" even though I pasted in a FASTA sequence!
Occasionally, the input field in BLAST fails to "understand" newlines and regards your input as one long line (containing nothing but a FASTA header). The workaround is to remove the header and only paste your sequence.

== Logo plots and weight matrices ==

=== Sequence logos ===

* ''When making a sequence logo, should I choose [http://weblogo.berkeley.edu/ WebLogo] or [https://services.healthtech.dtu.dk/services/Seq2Logo-2.0/ Seq2Logo]?''
For '''amino acid sequences''', you can use both. However, in Seq2Logo you should remember to set the Logo type to Shannon (where Kullback-Leibler is the default). In addition, you should set Clustering method to None and weight on prior to 0 (zero), if you want results that are comparable to those of WebLogo.
For '''nucleotide sequences''', you should use WebLogo.

=== WebLogo ===

* ''WebLogo is giving me the error message "Error: Invalid input format does not conform to FASTA, CLUSTAL, or Flat", but I know my file is a valid FASTA file!?''
Yes, WebLogo sometimes gives this error without reason when you try to upload a file. The workaround is to paste the contents of your file into the window instead of uploading the file.

* ''There is a 100% conserved position in my data. Shouldn't the information content then be 2 bits (nucleotides) or 4.3 bits (amino acids)? Why is it lower in the WebLogo output?''
That's because of the limited size of your data set. WebLogo by default applies a "[http://weblogo.berkeley.edu/info.html#ssc Small Sample Correction]" that shows only how much information is ''significantly'' above random. The smaller the sample, the lower the significant information. You can deselect this, if you want.

=== Pseudocounts ===

* ''What is actually the purpose of pseudocounts?''
The purpose of pseudocounts is to compensate for the fact that we have a limited amount of data. From the amino acids we have observed at a certain position, we try to estimate what the probabilities of the amino acids would have been if we had access to an infinitely large amount of sequences. The smaller our dataset is, the larger the importance of pseudocounts will be.

* ''When you are looking up e.g. q(G|S) in the [http://www.cbs.dtu.dk/dtucourse/27611spring2016/BLOSUM62-probabilities.txt table], do you go horizontally first, or vertically?''
The amino acid before "|" determines the column, while the amino acid after "|" determines the row. To look up q(G|S), use column G and row S, yielding 0.07.

* ''How do I do the calculations, if I'm asked to ignore pseudocounts?''
Ignoring pseudocounts means setting weight on pseudo counts (weight on prior, β) equal to 0. Then, pa = fa, and there is no reason to calculate ga.

* ''When do you choose β = 10,000?''
In one of the exercise questions, we set β to 10,000 (an arbitrary but very high value) in order to simplify calculations. When β is very high, α becomes effectively zero, and you can set pa = ga. In practical use, you would never use such a high value.

=== Sequence weighting / Clustering ===
* ''Where in the [[Media:Estimationofpseudocounts_new+examples.pdf|equations]] do I find how to do sequence weighting?''
You don't. Sequence weighting is not something you will be asked to do manually.

* ''When would you choose clustering rather than no clustering?''
Briefly, sequence weighting / clustering means that sequences that are very similar are grouped together and weighted down so that they count as one sequence in the calculation of the observed amino acid frequencies (fa). This can make a big difference if a group of sequences are very similar, see the example with the small training set in the weight matrix exercise.

=== EasyPred ===
* ''Help! I cannot open the para.dat file ("Parameters for prediction method") from EasyPred.''
When looking at "Parameters for prediction method", it is important that the ''right-click'' the link and choose Save link as.... If you just click the link, your browser may think that it is a videofile, because some video files have the extension ".dat". If the right-click approach does not work either, then try in another browser. After you have downloaded the para.dat file, you can open it in jEdit (or another plain text editor).

* ''Help! I pasted in the full sequence of a protein under "evaluation examples", but get an error such as: "Error reading eval data: Peptide number 1: "MEINVS..." are not 20 long"''
The problem is that you did not paste the sequence in FASTA format, i.e. including the header. EasyPred looks for the header (with the ">" character) in order to determine whether the input is in FASTA format (a full sequence that should be scanned) or in peptide format (where each peptide must have the same length as the model).

== PSI-BLAST ==

* ''I'm in doubt about when to use PSI-BLAST instead of normal BLASTP.''
PSI-BLAST should be used, if a normal BLASTP doesn't find what you are looking for.

If you e.g. are asked to find a matching structure, but normal BLASTP doesn't give you any PDB hits, it could be a good idea to run PSI-BLAST (running against '''nr''' first so that there are some hits to build a profile from, and then after 2-3 rounds download the PSSM and use it to search in PDB, see details in the PSI-BLAST exercise).

A similar situation could be that you are asked to find a homolog to a query protein in a specific organism or taxonomic group, but a normal BLASTP doesn't give you any hits in that group. Then, you can use the same procedure (again, see the PSI-BLAST exercise for details).

Yet another situation could be that you are asked to find a plausible function for a protein, but a normal BLASTP only gives you hypothetical hits; here, PSI-BLAST cold also be a possibility.

== Phylogeny ==

* ''What is the difference between Taxonomy and Phylogeny?''
Taxonomy is a broader concept than Phylogeny; it implies classification of organisms by any method (including Linnaeus' classification). Phylogeny implies evolutionary relationships. Modern taxonomy is, to a very high degree, phylogeny-based. Note that phylogeny is not necessarily molecular, it can be based on morphological characters such as organs or bones (the latter, of course, being tremendously important in classification of fossils).

=== 2017 exam, question 3b ===
* ''The tree I got from the Common Tree builder does not look like the one in the answer?''
You need to mark the include unranked (phylogenetic) taxa checkbox in the Common tree page to see the Terrabacteria group that links the two Gram-positive phyla together.

File:Gaps-2014-1g.JPG

2024-03-15T10:51:22Z

WikiSysop:

FAQ

2024-03-15T10:50:32Z

WikiSysop: Created page with "== Practical information == === Exam === * ''How do I find out where and when the exam is held?'' At http://www.eksamensplan.dtu.dk/ . * ''Which online platform will you use for the exam?'' This year we will be using Digital Exam (the new interface) which is accessed via https://eksamen.dtu.dk/ . We will ''not'' be using the old interface via http://onlineeksamen.dtu.dk/ . <!-- === COVID-19 information 2021 === This year, the written exam will be a '''home exam'''...."

== Practical information ==

=== Exam ===
* ''How do I find out where and when the exam is held?''
At http://www.eksamensplan.dtu.dk/ .

* ''Which online platform will you use for the exam?''
This year we will be using Digital Exam (the new interface) which is accessed via https://eksamen.dtu.dk/ .

We will ''not'' be using the old interface via http://onlineeksamen.dtu.dk/ .



=== Re-exam ===
* ''When will there be a re-exam?''
For those of you who either do not pass, or do not hand in, or signed off the exam, there will be an oral re-exam during May. The exact date and time is negotiable. Please note that you have to sign up for the re-exam in the study admin system.

* ''How will the re-exam take place?''
You draw a random written question which contains a minor practical task (an alignment, a BLAST search, a phylogeny or similar). Then you have 30 minutes preparation time to solve the given task using your own computer. You will have access to the net. Leave all relevant browser windows/tabs open, so that you afterwards can show how you have done. The examination will then last approximately 20 minutes and begin with your own presentation of what you have done to solve the task. Depending on how long time your presentation takes, we will also ask questions in other parts of the course curriculum. The grade will be given immediately after the exam.

== Bioinformatics in general ==

=== Protein to DNA ===
* ''How can I convert my protein sequence to DNA in FASTA format?''
Generally, you cannot "convert" protein sequence to DNA sequence, there is simply some information missing (the same protein sequence can originate from many different DNA sequences due to the redundancy in the genetic code). But if you have located a protein in UniProt, you can usually find one or more cross-references to the nucleotide sequence databases.

== GenBank ==
=== LOCUS / Accession / Version ===
* ''I'm in doubt about the difference between Locus, Accession and Version in GenBank .''
Each entry in GenBank has one and only one '''Locus''' code, which identifies the entry. Then it has one ''or more'' '''Accession''' codes, of which one is usually identical to the Locus code. Multiple accession codes suggest that the entry is a fusion of several entries from an earlier version of the database. Finally, the '''Version''' is the Locus code followed by a dot and a number which refers to the version of the ''sequence'' in the entry. If the number is higher than 1, it means that the sequence has been updated since the creation of the entry. See example below.
LOCUS AH002844 4969 bp DNA linear PRI 10-JUN-2016
DEFINITION Homo sapiens insulin (INS) gene, complete cds.
ACCESSION AH002844 J00265 J00268
VERSION AH002844.2

== UniProt ==

=== Old UniProt questions ===
* ''I' trying to solve this UniProt question in an old exam set, and I cannot get the number of hits to conform with the answer. What am I doing wrong?''
The answers are not updated every year. You cannot expect the number of hits to stay constant, since the database is growing over time. If your ''search string'' conforms to the answer, it's fine.
* ''But I cannot get the search string to conform with the answer, either?''
This is because of the UniProt 2022 interface change. Unfortunately, they also changed the syntax of the search strings.

=== Transmembrane proteins ===
* ''I'm in doubt about the difference between "<tt>annotation:(type:transmem)</tt>" og "<tt>annotation:(type:location "pass membrane")</tt>". The second one gives many more hits than the first one. Why?''
The difference is that search string #1 refers to a Feature Table (FT line) annotation and search string #2 refers to a comment (CC line) annotation. Thereby, #1 chooses only those proteins that have information about ''where'' in the sequence the transmembrane segments are, while #2 chooses all proteins known to have at least one transmembrane segment.

== Pairwise alignment ==
=== Gaps ===
* ''What are gaps precisely?''
Remember that a pairwise alignment is a hypothesis about two sequences being related through evolution. A gap is then a hypothesis about an insertion or a deletion that has taken place during that evolution.

* ''Why do you say there are only four gaps in the alignment shown here? Below the alignment, is is written that there are seven?''
[[File:gaps-2014-1g.JPG]]

Gaps can have different lengths; a gap can comprise one or several positions. In the example, there are three gaps of length one, and one gap of length four. That gives seven ''positions'' with gaps in total, but still only four gaps.

== Protein structure, PDB & PyMOL ==


=== Fetch in PyMOL ===
* ''What do I do if the <tt>fetch</tt> command does not work in PyMOL?''
It is perfectly possible to use PyMOL without <tt>fetch</tt>:
# Go to the [https://www.rcsb.org/ PDB homepage] and locate the structure you wanted to fetch;
# Click Download files in the top right corner, choose PDB format, and download the PDB file to your own computer;
# Click File → Open in the PyMOL menu and choose the file you just downloaded.

=== Background ===
* ''Why have you, in several answers to exam questions, made the background white?''
White background is usually better if you want to print the result (particularly on an inkjet printer!).

== BLAST ==

=== Choice of database ===

* ''I have problems choosing the right database when BLASTing, can you give some guidance?''
Here are some rules of thumb:
* For both '''blastp''' and '''blastn''', you should use nr (called nr/nt in blastn), if you want to search as widely as possible ("everything").
* In '''blastp''', you can use swissprot, if you specifically want to search for a ''reviewed entry'' from UniProt (UniProtKB/SwissProt).
* In '''blastp''', you can use pdb, if you specifically want to search for a ''structure''.
* When using '''PSI-BLAST''', you should always choose nr for ''constructing'' the PSSM, so that there is as much material as possible to work with. Then, you can choose a more narrow database when ''reusing'' the PSSM in a search.
* In '''blastn''', you can use Human genomic + transcript or Mouse genomic + transcript, if you specifically want to search in one of these two organisms.
* In both '''blastp''' og '''blastn''' you can use the Organism field to specify an organism or a taxonomic group.

=== Error: "Query contains no sequence data" ===
* ''Help! BLAST gives me the error message "Message ID#32 Error: Query contains no data: Query contains no sequence data" even though I pasted in a FASTA sequence!
Occasionally, the input field in BLAST fails to "understand" newlines and regards your input as one long line (containing nothing but a FASTA header). The workaround is to remove the header and only paste your sequence.

== Logo plots and weight matrices ==

=== Sequence logos ===

* ''When making a sequence logo, should I choose [http://weblogo.berkeley.edu/ WebLogo] or [https://services.healthtech.dtu.dk/service.php?Seq2Logo-2.0 Seq2Logo]?''
For '''amino acid sequences''', you can use both. However, in Seq2Logo you should remember to set the Logo type to Shannon (where Kullback-Leibler is the default). In addition, you should set Clustering method to None and weight on prior to 0 (zero), if you want results that are comparable to those of WebLogo.
For '''nucleotide sequences''', you should use WebLogo.

=== WebLogo ===

* ''WebLogo is giving me the error message "Error: Invalid input format does not conform to FASTA, CLUSTAL, or Flat", but I know my file is a valid FASTA file!?''
Yes, WebLogo sometimes gives this error without reason when you try to upload a file. The workaround is to paste the contents of your file into the window instead of uploading the file.

* ''There is a 100% conserved position in my data. Shouldn't the information content then be 2 bits (nucleotides) or 4.3 bits (amino acids)? Why is it lower in the WebLogo output?''
That's because of the limited size of your data set. WebLogo by default applies a "[http://weblogo.berkeley.edu/info.html#ssc Small Sample Correction]" that shows only how much information is ''significantly'' above random. The smaller the sample, the lower the significant information. You can deselect this, if you want.

=== Pseudocounts ===

* ''What is actually the purpose of pseudocounts?''
The purpose of pseudocounts is to compensate for the fact that we have a limited amount of data. From the amino acids we have observed at a certain position, we try to estimate what the probabilities of the amino acids would have been if we had access to an infinitely large amount of sequences. The smaller our dataset is, the larger the importance of pseudocounts will be.

* ''When you are looking up e.g. q(G|S) in the [http://www.cbs.dtu.dk/dtucourse/27611spring2016/BLOSUM62-probabilities.txt table], do you go horizontally first, or vertically?''
The amino acid before "|" determines the column, while the amino acid after "|" determines the row. To look up q(G|S), use column G and row S, yielding 0.07.

* ''How do I do the calculations, if I'm asked to ignore pseudocounts?''
Ignoring pseudocounts means setting weight on pseudo counts (weight on prior, β) equal to 0. Then, pa = fa, and there is no reason to calculate ga.

* ''When do you choose β = 10,000?''
In one of the exercise questions, we set β to 10,000 (an arbitrary but very high value) in order to simplify calculations. When β is very high, α becomes effectively zero, and you can set pa = ga. In practical use, you would never use such a high value.

=== Sequence weighting / Clustering ===
* ''Where in the [[Media:Estimationofpseudocounts_new+examples.pdf|equations]] do I find how to do sequence weighting?''
You don't. Sequence weighting is not something you will be asked to do manually.

* ''When would you choose clustering rather than no clustering?''
Briefly, sequence weighting / clustering means that sequences that are very similar are grouped together and weighted down so that they count as one sequence in the calculation of the observed amino acid frequencies (fa). This can make a big difference if a group of sequences are very similar, see the example with the small training set in the weight matrix exercise.

=== EasyPred ===
* ''Help! I cannot open the para.dat file ("Parameters for prediction method") from EasyPred.''
When looking at "Parameters for prediction method", it is important that the ''right-click'' the link and choose Save link as.... If you just click the link, your browser may think that it is a videofile, because some video files have the extension ".dat". If the right-click approach does not work either, then try in another browser. After you have downloaded the para.dat file, you can open it in jEdit (or another plain text editor).

* ''Help! I pasted in the full sequence of a protein under "evaluation examples", but get an error such as: "Error reading eval data: Peptide number 1: "MEINVS..." are not 20 long"''
The problem is that you did not paste the sequence in FASTA format, i.e. including the header. EasyPred looks for the header (with the ">" character) in order to determine whether the input is in FASTA format (a full sequence that should be scanned) or in peptide format (where each peptide must have the same length as the model).

== PSI-BLAST ==

* ''I'm in doubt about when to use PSI-BLAST instead of normal BLASTP.''
PSI-BLAST should be used, if a normal BLASTP doesn't find what you are looking for.

If you e.g. are asked to find a matching structure, but normal BLASTP doesn't give you any PDB hits, it could be a good idea to run PSI-BLAST (running against '''nr''' first so that there are some hits to build a profile from, and then after 2-3 rounds download the PSSM and use it to search in PDB, see details in the PSI-BLAST exercise).

A similar situation could be that you are asked to find a homolog to a query protein in a specific organism or taxonomic group, but a normal BLASTP doesn't give you any hits in that group. Then, you can use the same procedure (again, see the PSI-BLAST exercise for details).

Yet another situation could be that you are asked to find a plausible function for a protein, but a normal BLASTP only gives you hypothetical hits; here, PSI-BLAST cold also be a possibility.

== Phylogeny ==

* ''What is the difference between Taxonomy and Phylogeny?''
Taxonomy is a broader concept than Phylogeny; it implies classification of organisms by any method (including Linnaeus' classification). Phylogeny implies evolutionary relationships. Modern taxonomy is, to a very high degree, phylogeny-based. Note that phylogeny is not necessarily molecular, it can be based on morphological characters such as organs or bones (the latter, of course, being tremendously important in classification of fossils).

=== 2017 exam, question 3b ===
* ''The tree I got from the Common Tree builder does not look like the one in the answer?''
You need to mark the include unranked (phylogenetic) taxa checkbox in the Common tree page to see the Terrabacteria group that links the two Gram-positive phyla together.

Link collection

2024-03-15T10:49:49Z

WikiSysop: /* Weight matrices and sequence logos */

== Taxonomy ==

;Tree of Life http://www.tolweb.org/
:(Good descriptive Taxonomy database — limited range of organisms).

;NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/
:(Somewhat "technical" but very exhaustive taxonomical database. TaxIDs are also used in GenBank and UniProt).
: The "Common Tree" function can be used to investigate how closely related two or more organisms are: http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi
: NCBI search with "Token set" can be used if you do not know the Latin name: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

== DNA Databases ==
;GenBank

:Search page: https://www.ncbi.nlm.nih.gov/nucleotide

;SGD (Saccharomyces Genome Database) http://www.yeastgenome.org
:(The Baker's yeast genome)

;Gene https://www.ncbi.nlm.nih.gov/gene/
:Database of genes in completely sequenced genomes and their phenotypes.

== Translation ==
;Virtual Ribosome
:https://services.healthtech.dtu.dk/services/VirtualRibosome-2.0/

;"The Genetic Codes" (NCBI) https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
:Information about translation codes

== Protein databases ==
=== Protein sequence and annotations ===
;UniProt https://www.uniprot.org

=== Protein 3D structure ===
;PDB (Protein Data Bank) http://www.rcsb.org/

=== Protein domains ===
;InterPro https://www.ebi.ac.uk/interpro/

== Alignment ==
=== Pairwise alignment ===
;Pairwise alignment (global and local) http://www.ebi.ac.uk/emboss/
:Use "Needle" for global alignment and "Water" for local alignment.
;Shuffle a sequence in random order (to get a null model):
:Protein: http://www.bioinformatics.org/sms2/shuffle_protein.html
:DNA: http://www.bioinformatics.org/sms2/shuffle_dna.html


=== Multiple alignment ===
The multiple alignment programs MUSCLE and Clustal Omega are built into Seaview, which should be installed on your computer.
;Other multiple alignment methods on EBI's server:
* T-Coffee https://www.ebi.ac.uk/Tools/msa/tcoffee/
* MAFFT https://www.ebi.ac.uk/Tools/msa/mafft/
* Kalign https://www.ebi.ac.uk/Tools/msa/kalign/

;RevTrans
: Special method for aligning ''coding'' DNA. https://services.healthtech.dtu.dk/services/RevTrans-2.0/

== Phylogenetic trees ==

Seaview can draw simple trees, but if you need more options and annotations, go to:
;interactive Tree Of Life (iTOL)
: https://itol.embl.de/

== BLAST ==
Note: Most sequence databases, including UniProt and RCSB PDB, offer an option for doing BLAST searches. In the course we have used NCBI's BLAST, since NCBI has the largest selection of databases and is the home of GenBank.

;NCBI BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi
* BLASTN: Choose "nucleotide blast" and "blastn" on the next page.
: NB: We do not use "megablast" in this course (it is constructed for finding sequences that are very similar).
* BLASTP: Choose "protein blast" and "blastp" on the next page.
: Note the information about conserved protein domains near the top of the results page. Click the domain to see further information.
Remember for BLASTN and BLASTP to choose a relevant database (use NR/NT to get the grand overview; but use PDB for structures, or specify an organism or taxonomic group under Organism if it makes sense for your task).

;PSI-BLAST
: Go to NCBI BLAST (see above) and choose "Protein blast" — on the next page you can then choose PSI-BLAST.

== Weight matrices and sequence logos ==
;WebLogo http://weblogo.berkeley.edu/
:A good general-purpose logo generator for BOTH DNA and peptide sequences.
:Alternate link to version 3 (lacks some options): http://weblogo.threeplusone.com/
;Seq2Logo
:A more advanced method for working with peptide sequences. https://services.healthtech.dtu.dk/services/Seq2Logo-2.0/
;EasyPred
:Make a logo AND train a weight matrix using clustering and pseudocounts. https://services.healthtech.dtu.dk/services/EasyPred-1.0/

Link collection

2024-03-15T10:49:18Z

WikiSysop: /* Multiple alignment */

== Taxonomy ==

;Tree of Life http://www.tolweb.org/
:(Good descriptive Taxonomy database — limited range of organisms).

;NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/
:(Somewhat "technical" but very exhaustive taxonomical database. TaxIDs are also used in GenBank and UniProt).
: The "Common Tree" function can be used to investigate how closely related two or more organisms are: http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi
: NCBI search with "Token set" can be used if you do not know the Latin name: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

== DNA Databases ==
;GenBank

:Search page: https://www.ncbi.nlm.nih.gov/nucleotide

;SGD (Saccharomyces Genome Database) http://www.yeastgenome.org
:(The Baker's yeast genome)

;Gene https://www.ncbi.nlm.nih.gov/gene/
:Database of genes in completely sequenced genomes and their phenotypes.

== Translation ==
;Virtual Ribosome
:https://services.healthtech.dtu.dk/services/VirtualRibosome-2.0/

;"The Genetic Codes" (NCBI) https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
:Information about translation codes

== Protein databases ==
=== Protein sequence and annotations ===
;UniProt https://www.uniprot.org

=== Protein 3D structure ===
;PDB (Protein Data Bank) http://www.rcsb.org/

=== Protein domains ===
;InterPro https://www.ebi.ac.uk/interpro/

== Alignment ==
=== Pairwise alignment ===
;Pairwise alignment (global and local) http://www.ebi.ac.uk/emboss/
:Use "Needle" for global alignment and "Water" for local alignment.
;Shuffle a sequence in random order (to get a null model):
:Protein: http://www.bioinformatics.org/sms2/shuffle_protein.html
:DNA: http://www.bioinformatics.org/sms2/shuffle_dna.html


=== Multiple alignment ===
The multiple alignment programs MUSCLE and Clustal Omega are built into Seaview, which should be installed on your computer.
;Other multiple alignment methods on EBI's server:
* T-Coffee https://www.ebi.ac.uk/Tools/msa/tcoffee/
* MAFFT https://www.ebi.ac.uk/Tools/msa/mafft/
* Kalign https://www.ebi.ac.uk/Tools/msa/kalign/

;RevTrans
: Special method for aligning ''coding'' DNA. https://services.healthtech.dtu.dk/services/RevTrans-2.0/

== Phylogenetic trees ==

Seaview can draw simple trees, but if you need more options and annotations, go to:
;interactive Tree Of Life (iTOL)
: https://itol.embl.de/

== BLAST ==
Note: Most sequence databases, including UniProt and RCSB PDB, offer an option for doing BLAST searches. In the course we have used NCBI's BLAST, since NCBI has the largest selection of databases and is the home of GenBank.

;NCBI BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi
* BLASTN: Choose "nucleotide blast" and "blastn" on the next page.
: NB: We do not use "megablast" in this course (it is constructed for finding sequences that are very similar).
* BLASTP: Choose "protein blast" and "blastp" on the next page.
: Note the information about conserved protein domains near the top of the results page. Click the domain to see further information.
Remember for BLASTN and BLASTP to choose a relevant database (use NR/NT to get the grand overview; but use PDB for structures, or specify an organism or taxonomic group under Organism if it makes sense for your task).

;PSI-BLAST
: Go to NCBI BLAST (see above) and choose "Protein blast" — on the next page you can then choose PSI-BLAST.

== Weight matrices and sequence logos ==
;WebLogo http://weblogo.berkeley.edu/
:A good general-purpose logo generator for BOTH DNA and peptide sequences.
:Alternate link to version 3 (lacks some options): http://weblogo.threeplusone.com/
;Seq2Logo
:A more advanced method for working with peptide sequences. https://services.healthtech.dtu.dk/service.php?Seq2Logo-2.0
;EasyPred
:Make a logo AND train a weight matrix using clustering and pseudocounts. https://services.healthtech.dtu.dk/service.php?EasyPred-1.0

Link collection

2024-03-15T10:48:57Z

WikiSysop: /* Translation */

== Taxonomy ==

;Tree of Life http://www.tolweb.org/
:(Good descriptive Taxonomy database — limited range of organisms).

;NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/
:(Somewhat "technical" but very exhaustive taxonomical database. TaxIDs are also used in GenBank and UniProt).
: The "Common Tree" function can be used to investigate how closely related two or more organisms are: http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi
: NCBI search with "Token set" can be used if you do not know the Latin name: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

== DNA Databases ==
;GenBank

:Search page: https://www.ncbi.nlm.nih.gov/nucleotide

;SGD (Saccharomyces Genome Database) http://www.yeastgenome.org
:(The Baker's yeast genome)

;Gene https://www.ncbi.nlm.nih.gov/gene/
:Database of genes in completely sequenced genomes and their phenotypes.

== Translation ==
;Virtual Ribosome
:https://services.healthtech.dtu.dk/services/VirtualRibosome-2.0/

;"The Genetic Codes" (NCBI) https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
:Information about translation codes

== Protein databases ==
=== Protein sequence and annotations ===
;UniProt https://www.uniprot.org

=== Protein 3D structure ===
;PDB (Protein Data Bank) http://www.rcsb.org/

=== Protein domains ===
;InterPro https://www.ebi.ac.uk/interpro/

== Alignment ==
=== Pairwise alignment ===
;Pairwise alignment (global and local) http://www.ebi.ac.uk/emboss/
:Use "Needle" for global alignment and "Water" for local alignment.
;Shuffle a sequence in random order (to get a null model):
:Protein: http://www.bioinformatics.org/sms2/shuffle_protein.html
:DNA: http://www.bioinformatics.org/sms2/shuffle_dna.html


=== Multiple alignment ===
The multiple alignment programs MUSCLE and Clustal Omega are built into Seaview, which should be installed on your computer.
;Other multiple alignment methods on EBI's server:
* T-Coffee https://www.ebi.ac.uk/Tools/msa/tcoffee/
* MAFFT https://www.ebi.ac.uk/Tools/msa/mafft/
* Kalign https://www.ebi.ac.uk/Tools/msa/kalign/

;RevTrans
: Special method for aligning ''coding'' DNA. https://services.healthtech.dtu.dk/service.php?RevTrans-2.0

== Phylogenetic trees ==

Seaview can draw simple trees, but if you need more options and annotations, go to:
;interactive Tree Of Life (iTOL)
: https://itol.embl.de/

== BLAST ==
Note: Most sequence databases, including UniProt and RCSB PDB, offer an option for doing BLAST searches. In the course we have used NCBI's BLAST, since NCBI has the largest selection of databases and is the home of GenBank.

;NCBI BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi
* BLASTN: Choose "nucleotide blast" and "blastn" on the next page.
: NB: We do not use "megablast" in this course (it is constructed for finding sequences that are very similar).
* BLASTP: Choose "protein blast" and "blastp" on the next page.
: Note the information about conserved protein domains near the top of the results page. Click the domain to see further information.
Remember for BLASTN and BLASTP to choose a relevant database (use NR/NT to get the grand overview; but use PDB for structures, or specify an organism or taxonomic group under Organism if it makes sense for your task).

;PSI-BLAST
: Go to NCBI BLAST (see above) and choose "Protein blast" — on the next page you can then choose PSI-BLAST.

== Weight matrices and sequence logos ==
;WebLogo http://weblogo.berkeley.edu/
:A good general-purpose logo generator for BOTH DNA and peptide sequences.
:Alternate link to version 3 (lacks some options): http://weblogo.threeplusone.com/
;Seq2Logo
:A more advanced method for working with peptide sequences. https://services.healthtech.dtu.dk/service.php?Seq2Logo-2.0
;EasyPred
:Make a logo AND train a weight matrix using clustering and pseudocounts. https://services.healthtech.dtu.dk/service.php?EasyPred-1.0

Link collection

2024-03-15T10:48:04Z

WikiSysop: Created page with "== Taxonomy == ;Tree of Life http://www.tolweb.org/ :(Good descriptive Taxonomy database — limited range of organisms). ;NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/ :(Somewhat "technical" but very exhaustive taxonomical database. TaxIDs are also used in GenBank and UniProt). : The "Common Tree" function can be used to investigate how closely related two or more organisms are: http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi : NCBI search with "Toke..."

== Taxonomy ==

;Tree of Life http://www.tolweb.org/
:(Good descriptive Taxonomy database — limited range of organisms).

;NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/
:(Somewhat "technical" but very exhaustive taxonomical database. TaxIDs are also used in GenBank and UniProt).
: The "Common Tree" function can be used to investigate how closely related two or more organisms are: http://www.ncbi.nlm.nih.gov/Taxonomy/CommonTree/wwwcmt.cgi
: NCBI search with "Token set" can be used if you do not know the Latin name: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi

== DNA Databases ==
;GenBank

:Search page: https://www.ncbi.nlm.nih.gov/nucleotide

;SGD (Saccharomyces Genome Database) http://www.yeastgenome.org
:(The Baker's yeast genome)

;Gene https://www.ncbi.nlm.nih.gov/gene/
:Database of genes in completely sequenced genomes and their phenotypes.

== Translation ==
;Virtual Ribosome
:https://services.healthtech.dtu.dk/service.php?VirtualRibosome-2.0

;"The Genetic Codes" (NCBI) https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
:Information about translation codes

== Protein databases ==
=== Protein sequence and annotations ===
;UniProt https://www.uniprot.org

=== Protein 3D structure ===
;PDB (Protein Data Bank) http://www.rcsb.org/

=== Protein domains ===
;InterPro https://www.ebi.ac.uk/interpro/

== Alignment ==
=== Pairwise alignment ===
;Pairwise alignment (global and local) http://www.ebi.ac.uk/emboss/
:Use "Needle" for global alignment and "Water" for local alignment.
;Shuffle a sequence in random order (to get a null model):
:Protein: http://www.bioinformatics.org/sms2/shuffle_protein.html
:DNA: http://www.bioinformatics.org/sms2/shuffle_dna.html


=== Multiple alignment ===
The multiple alignment programs MUSCLE and Clustal Omega are built into Seaview, which should be installed on your computer.
;Other multiple alignment methods on EBI's server:
* T-Coffee https://www.ebi.ac.uk/Tools/msa/tcoffee/
* MAFFT https://www.ebi.ac.uk/Tools/msa/mafft/
* Kalign https://www.ebi.ac.uk/Tools/msa/kalign/

;RevTrans
: Special method for aligning ''coding'' DNA. https://services.healthtech.dtu.dk/service.php?RevTrans-2.0

== Phylogenetic trees ==

Seaview can draw simple trees, but if you need more options and annotations, go to:
;interactive Tree Of Life (iTOL)
: https://itol.embl.de/

== BLAST ==
Note: Most sequence databases, including UniProt and RCSB PDB, offer an option for doing BLAST searches. In the course we have used NCBI's BLAST, since NCBI has the largest selection of databases and is the home of GenBank.

;NCBI BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi
* BLASTN: Choose "nucleotide blast" and "blastn" on the next page.
: NB: We do not use "megablast" in this course (it is constructed for finding sequences that are very similar).
* BLASTP: Choose "protein blast" and "blastp" on the next page.
: Note the information about conserved protein domains near the top of the results page. Click the domain to see further information.
Remember for BLASTN and BLASTP to choose a relevant database (use NR/NT to get the grand overview; but use PDB for structures, or specify an organism or taxonomic group under Organism if it makes sense for your task).

;PSI-BLAST
: Go to NCBI BLAST (see above) and choose "Protein blast" — on the next page you can then choose PSI-BLAST.

== Weight matrices and sequence logos ==
;WebLogo http://weblogo.berkeley.edu/
:A good general-purpose logo generator for BOTH DNA and peptide sequences.
:Alternate link to version 3 (lacks some options): http://weblogo.threeplusone.com/
;Seq2Logo
:A more advanced method for working with peptide sequences. https://services.healthtech.dtu.dk/service.php?Seq2Logo-2.0
;EasyPred
:Make a logo AND train a weight matrix using clustering and pseudocounts. https://services.healthtech.dtu.dk/service.php?EasyPred-1.0

Checklist for computers

2024-03-15T10:46:53Z

WikiSysop: /* Software */

At the exam in Introduction to Bioinformatics, you are going to use the same resources (web-servers and programs) that you used for the exercises — if your computer has worked fine for all the exercises, it will also work fine for the exam.

== Hardware ==
; A laptop.
: It makes no difference whether you use Windows, Mac, or Linux, as long as you have the listed software. However, an iPad, an Android tablet, or a Chromebook will NOT be enough for the exam.
; A mouse.
: A mouse is important in order to use PyMOL (see below) optimally. The mouse should have two buttons plus a scroll-wheel in the middle.

== Internet connection ==


Your computer must be able to connect to DTU wi-fi. '''Use the standard DTU net which will be open during your exam; NOT the special exam net.'''

== Software ==
; A modern Internet Browser ([http://www.mozilla.com/ FireFox], Edge, Safari (Mac only), [http://www.google.com/chrome Google Chrome], [http://www.opera.com/ Opera]).
: NB: You ''must'' have FireFox, Chrome, or Opera on your machine, so you are able to switch browser in case Edge (Windows) or Safari (Mac) has trouble with a certain website.


; Geany (or another GOOD plain text editor)
: Download and install from http://geany.org/, see the [[Plain text files and Geany]] exercise.


; PyMol
: Download and install from https://pymol.org/2/, see the [https://teaching.healthtech.dtu.dk/material/22111/PyMol_tutorial2017_v4.pdf PyMol tutorial] and the exercises in [[Protein Structure and Visualization|PDB & PyMol]], and [[Exercise:Malaria Vaccine|Malaria vaccine]]. 
: A license file is found on Learn → Content → Week 06.

; Seaview
: Download and install from http://doua.prabi.fr/software/seaview, see [[Exercise: Multiple Alignments (Seaview version)]].


; Text processing software
: for writing your answers. You can e.g. use:
:* Microsoft Word
:* [http://www.openoffice.org/ OpenOffice] / [http://www.neooffice.org/ NeoOffice] / [http://www.libreoffice.org/ LibreOffice]
:* Pages (for Mac).
:* [https://docs.google.com/ Google Docs]

; Tool for making PDF files
: included in Windows 10/11 and Mac.

; Tool for taking screenshots
: included in Microsoft Word, Windows 10/11, and Mac.
: For Windows users we recommend the free program [http://getgreenshot.org/ Greenshot] which can not only take screenshots and copy them to the clipboard, but also make simple edits and annotations in the screenshots.