FAQ: Difference between revisions
Line 144: | Line 144: | ||
=== Sequence weighting / Clustering === | === Sequence weighting / Clustering === | ||
* ''Where in the [ | * ''Where in the [https://teaching.healthtech.dtu.dk/material/22111/Estimationofpseudocounts_new+examples.pdf equations] do I find how to do sequence weighting?'' | ||
You don't. Sequence weighting is not something you will be asked to do manually. | You don't. Sequence weighting is not something you will be asked to do manually. | ||
Revision as of 11:52, 15 March 2024
Practical information
Exam
- How do I find out where and when the exam is held?
At http://www.eksamensplan.dtu.dk/ .
- Which online platform will you use for the exam?
This year we will be using Digital Exam (the new interface) which is accessed via https://eksamen.dtu.dk/ .
We will not be using the old interface via http://onlineeksamen.dtu.dk/ .
Re-exam
- When will there be a re-exam?
For those of you who either do not pass, or do not hand in, or signed off the exam, there will be an oral re-exam during May. The exact date and time is negotiable. Please note that you have to sign up for the re-exam in the study admin system.
- How will the re-exam take place?
You draw a random written question which contains a minor practical task (an alignment, a BLAST search, a phylogeny or similar). Then you have 30 minutes preparation time to solve the given task using your own computer. You will have access to the net. Leave all relevant browser windows/tabs open, so that you afterwards can show how you have done. The examination will then last approximately 20 minutes and begin with your own presentation of what you have done to solve the task. Depending on how long time your presentation takes, we will also ask questions in other parts of the course curriculum. The grade will be given immediately after the exam.
Bioinformatics in general
Protein to DNA
- How can I convert my protein sequence to DNA in FASTA format?
Generally, you cannot "convert" protein sequence to DNA sequence, there is simply some information missing (the same protein sequence can originate from many different DNA sequences due to the redundancy in the genetic code). But if you have located a protein in UniProt, you can usually find one or more cross-references to the nucleotide sequence databases.
GenBank
LOCUS / Accession / Version
- I'm in doubt about the difference between Locus, Accession and Version in GenBank .
Each entry in GenBank has one and only one Locus code, which identifies the entry. Then it has one or more Accession codes, of which one is usually identical to the Locus code. Multiple accession codes suggest that the entry is a fusion of several entries from an earlier version of the database. Finally, the Version is the Locus code followed by a dot and a number which refers to the version of the sequence in the entry. If the number is higher than 1, it means that the sequence has been updated since the creation of the entry. See example below.
LOCUS AH002844 4969 bp DNA linear PRI 10-JUN-2016 DEFINITION Homo sapiens insulin (INS) gene, complete cds. ACCESSION AH002844 J00265 J00268 VERSION AH002844.2
UniProt
Old UniProt questions
- I' trying to solve this UniProt question in an old exam set, and I cannot get the number of hits to conform with the answer. What am I doing wrong?
The answers are not updated every year. You cannot expect the number of hits to stay constant, since the database is growing over time. If your search string conforms to the answer, it's fine.
- But I cannot get the search string to conform with the answer, either?
This is because of the UniProt 2022 interface change. Unfortunately, they also changed the syntax of the search strings.
Transmembrane proteins
- I'm in doubt about the difference between "annotation:(type:transmem)" og "annotation:(type:location "pass membrane")". The second one gives many more hits than the first one. Why?
The difference is that search string #1 refers to a Feature Table (FT line) annotation and search string #2 refers to a comment (CC line) annotation. Thereby, #1 chooses only those proteins that have information about where in the sequence the transmembrane segments are, while #2 chooses all proteins known to have at least one transmembrane segment.
Pairwise alignment
Gaps
- What are gaps precisely?
Remember that a pairwise alignment is a hypothesis about two sequences being related through evolution. A gap is then a hypothesis about an insertion or a deletion that has taken place during that evolution.
- Why do you say there are only four gaps in the alignment shown here? Below the alignment, is is written that there are seven?
Gaps can have different lengths; a gap can comprise one or several positions. In the example, there are three gaps of length one, and one gap of length four. That gives seven positions with gaps in total, but still only four gaps.
Protein structure, PDB & PyMOL
Fetch in PyMOL
- What do I do if the fetch command does not work in PyMOL?
It is perfectly possible to use PyMOL without fetch:
- Go to the PDB homepage and locate the structure you wanted to fetch;
- Click Download files in the top right corner, choose PDB format, and download the PDB file to your own computer;
- Click File → Open in the PyMOL menu and choose the file you just downloaded.
Background
- Why have you, in several answers to exam questions, made the background white?
White background is usually better if you want to print the result (particularly on an inkjet printer!).
BLAST
Choice of database
- I have problems choosing the right database when BLASTing, can you give some guidance?
Here are some rules of thumb:
- For both blastp and blastn, you should use nr (called nr/nt in blastn), if you want to search as widely as possible ("everything").
- In blastp, you can use swissprot, if you specifically want to search for a reviewed entry from UniProt (UniProtKB/SwissProt).
- In blastp, you can use pdb, if you specifically want to search for a structure.
- When using PSI-BLAST, you should always choose nr for constructing the PSSM, so that there is as much material as possible to work with. Then, you can choose a more narrow database when reusing the PSSM in a search.
- In blastn, you can use Human genomic + transcript or Mouse genomic + transcript, if you specifically want to search in one of these two organisms.
- In both blastp og blastn you can use the Organism field to specify an organism or a taxonomic group.
Error: "Query contains no sequence data"
- Help! BLAST gives me the error message "Message ID#32 Error: Query contains no data: Query contains no sequence data" even though I pasted in a FASTA sequence!
Occasionally, the input field in BLAST fails to "understand" newlines and regards your input as one long line (containing nothing but a FASTA header). The workaround is to remove the header and only paste your sequence.
Logo plots and weight matrices
Sequence logos
For amino acid sequences, you can use both. However, in Seq2Logo you should remember to set the Logo type to Shannon (where Kullback-Leibler is the default). In addition, you should set Clustering method to None and weight on prior to 0 (zero), if you want results that are comparable to those of WebLogo. For nucleotide sequences, you should use WebLogo.
WebLogo
- WebLogo is giving me the error message "Error: Invalid input format does not conform to FASTA, CLUSTAL, or Flat", but I know my file is a valid FASTA file!?
Yes, WebLogo sometimes gives this error without reason when you try to upload a file. The workaround is to paste the contents of your file into the window instead of uploading the file.
- There is a 100% conserved position in my data. Shouldn't the information content then be 2 bits (nucleotides) or 4.3 bits (amino acids)? Why is it lower in the WebLogo output?
That's because of the limited size of your data set. WebLogo by default applies a "Small Sample Correction" that shows only how much information is significantly above random. The smaller the sample, the lower the significant information. You can deselect this, if you want.
Pseudocounts
- What is actually the purpose of pseudocounts?
The purpose of pseudocounts is to compensate for the fact that we have a limited amount of data. From the amino acids we have observed at a certain position, we try to estimate what the probabilities of the amino acids would have been if we had access to an infinitely large amount of sequences. The smaller our dataset is, the larger the importance of pseudocounts will be.
- When you are looking up e.g. q(G|S) in the table, do you go horizontally first, or vertically?
The amino acid before "|" determines the column, while the amino acid after "|" determines the row. To look up q(G|S), use column G and row S, yielding 0.07.
- How do I do the calculations, if I'm asked to ignore pseudocounts?
Ignoring pseudocounts means setting weight on pseudo counts (weight on prior, β) equal to 0. Then, pa = fa, and there is no reason to calculate ga.
- When do you choose β = 10,000?
In one of the exercise questions, we set β to 10,000 (an arbitrary but very high value) in order to simplify calculations. When β is very high, α becomes effectively zero, and you can set pa = ga. In practical use, you would never use such a high value.
Sequence weighting / Clustering
- Where in the equations do I find how to do sequence weighting?
You don't. Sequence weighting is not something you will be asked to do manually.
- When would you choose clustering rather than no clustering?
Briefly, sequence weighting / clustering means that sequences that are very similar are grouped together and weighted down so that they count as one sequence in the calculation of the observed amino acid frequencies (fa). This can make a big difference if a group of sequences are very similar, see the example with the small training set in the weight matrix exercise.
EasyPred
- Help! I cannot open the para.dat file ("Parameters for prediction method") from EasyPred.
When looking at "Parameters for prediction method", it is important that the right-click the link and choose Save link as.... If you just click the link, your browser may think that it is a videofile, because some video files have the extension ".dat". If the right-click approach does not work either, then try in another browser. After you have downloaded the para.dat file, you can open it in jEdit (or another plain text editor).
- Help! I pasted in the full sequence of a protein under "evaluation examples", but get an error such as: "Error reading eval data: Peptide number 1: "MEINVS..." are not 20 long"
The problem is that you did not paste the sequence in FASTA format, i.e. including the header. EasyPred looks for the header (with the ">" character) in order to determine whether the input is in FASTA format (a full sequence that should be scanned) or in peptide format (where each peptide must have the same length as the model).
PSI-BLAST
- I'm in doubt about when to use PSI-BLAST instead of normal BLASTP.
PSI-BLAST should be used, if a normal BLASTP doesn't find what you are looking for.
If you e.g. are asked to find a matching structure, but normal BLASTP doesn't give you any PDB hits, it could be a good idea to run PSI-BLAST (running against nr first so that there are some hits to build a profile from, and then after 2-3 rounds download the PSSM and use it to search in PDB, see details in the PSI-BLAST exercise).
A similar situation could be that you are asked to find a homolog to a query protein in a specific organism or taxonomic group, but a normal BLASTP doesn't give you any hits in that group. Then, you can use the same procedure (again, see the PSI-BLAST exercise for details).
Yet another situation could be that you are asked to find a plausible function for a protein, but a normal BLASTP only gives you hypothetical hits; here, PSI-BLAST cold also be a possibility.
Phylogeny
- What is the difference between Taxonomy and Phylogeny?
Taxonomy is a broader concept than Phylogeny; it implies classification of organisms by any method (including Linnaeus' classification). Phylogeny implies evolutionary relationships. Modern taxonomy is, to a very high degree, phylogeny-based. Note that phylogeny is not necessarily molecular, it can be based on morphological characters such as organs or bones (the latter, of course, being tremendously important in classification of fossils).
2017 exam, question 3b
- The tree I got from the Common Tree builder does not look like the one in the answer?
You need to mark the include unranked (phylogenetic) taxa checkbox in the Common tree page to see the Terrabacteria group that links the two Gram-positive phyla together.