ExYeastCellCycleTranscriptomics R
Yeast cell cycle / transcriptomics exercise #1
Exercise written by: Rasmus Wernersson and Lars Rønn Olsen
Learning objectives:
- Introduction to practical work on array CASE/CONTROL studies.
- Introduction to ArrayExpress as a way to download DNA microarray experimental data.
- Introduction to basic expression data transformation:
- Calculation of Log2 expression values
- Basic evaluation of the comparability of two arrays.
- Calculation of FOLD CHANGE and Log(Fold Change)
- Gene Ontology overrepresentation of RANKED DATA.
Data set: Alpha Factor arrest
As mentioned in greater detail in the previous lecture, alpha-factor arrest works by inducing yeast's natural mating behavior: The HAPLOID cells in the vegetative state prepare to undergo cell- and nucleus fusion with HAPLOID cells of the opposite mating type (A vs. ALPHA). This is followed by MEIOSIS and spore formation. Therefore, it is important that the cells fusing are not in different stages of the MITOTIC cell cycle, as this would have disastrous effect (spend a moment thinking about why).
Yeast evolution solved this problem by triggering an ARREST in the cell cycle at the G1/S boundary when the mating hormone of the opposite mating type is detected (a-factor, or alpha-factor). Experimentally this can be used to halt virtually all of the cells in a growing culture at the same stage in cell cycle.
Downloading data from Gene Expression Omnibus (GEO)
NOTICE: we will cut some corners and provide partly "pre-cooked" data in this part of the exercise, as performing the full array analysis is out of scope for this course. However, we still need to get at least a basic understanding of what the data looks like, where it comes from, and SOME of the steps we need to go through to make it useful for our analysis.
GEO: https://www.ncbi.nlm.nih.gov/geo/
- Find experiment GSE11412 at GEO
- Answering the questions below will require a bit of exploring / clicking around on links on your own.
REPORT QUESTION #1:
- How much RNA ("total RNA") was used in the experiment?
- Which type/brand of array was used?
- How many samples are in the study?
- IMPORTANT: note down which sample_ids is used for CONTROL (asynchronous; "mock treated") and which sample_ids is use for CASE (arrested cells) - we'll need this information later.
Load data
There are many ways you can load the data into R. You can download raw data (that needs to be analyzed from scratch) or processed data (that have been quality checked, normalized and summarized to probe sets) can be downloaded and imported into R. You can also use packages such as "GEOquery" or "geneExpressionFromGEO" to extract data directly from GEO. We have cheated a bit and prepared a data frame for you to work with:
load("/home/projects/22140/exercise6.Rdata")
Check point: there are 5 columns of data in your data frame: Systematic gene name, Popular gene name, ProbeSet, Case (arrayname), Control (arrayname)
Getting the data ready for analysis
Before we can use the gene expression data for our analysis, we need to go through a few steps to prepare it to use.
Verify the data
First we'll perform a bit of sanity-checking on the data. Never blindly trust that the data is OK. As a rule of thumb it is assumed that only a minority of genes are differentially regulated between the case and control samples. If the processed data we download has been normalized correctly, we should expect a scatter plot of the case vs. control to be on a diagonal.
TASK/REPORT QUESTION #2: create a scatter plot for case vs. control
- Paste the plot into your report.
- Do the two arrays indeed appear to be comparable, or did we goof up the normalization and/or experiments?
- On what order of magnitude is the largest expression value?
As you could (hopefully) see from the plot, the majority of the expression values are in the low end of the scale, and this makes it a bit difficult to see finer details in this part of the plot. Since values of expression data often span several orders of magnitude, a common "trick" is to log-transform the values to bring them on a more comparable scale. In bioinformatics the tradition is to log2 transform all data - we'll get back to why in a bit.
TASK/REPORT QUESTION #3: create a scatter plot for log2(case) vs. log2(control)
- Include the plot in your report.
- Do the data appear to be comparable in the low end of the scale now?
Fold Change
Now we get into the real meat of a case-control study: the ability to quantify the difference between the two conditions.
In this case we only have a single measurement of gene expression for each condition, so more advanced statistical methods are not applicable. However, we can still learn a lot about the trends of the total data set by calculating the change in expression.
In essence this is done calculating the ratio between the two values (e.g. case/control) for each gene. However, there are a few things we need to take into consideration first:
REPORT QUESTION #4:
- In what interval will the ratio be if case is smaller than control (aka the largest and the smallest value the fold-change can theoretically be (not what is observed in your data))?
- In what interval will the ratio be if case is larger than control?
- Do you foresee any problems comparing the ratios?
Once again log-transformation will come to our rescue - but let's investigate why.
REPORT QUESTION #5: - if we log2 transform the ratios what will be the result interval in the following cases:
- CASE < CONTROL
- CASE > CONTROL
- Are the values comparable now?
- What will happen if you accidentally swap the case and control numbers? (Calculate an example e.g. case = 7.1 , control = 2.3) - can you see any (dis)advantages with this system?
TASK/REPORT QUESTION #6: Calculate ratio and log2(ratio)
- Calculate both ratio (case/control) and log2(ratio) for all the expression data. Name the columns fc and log2fc (see the box below for why).
- If you use Excel this can be done by only TWO operations per column (ask for help if needed).
Check point: you should have 7 columns of data in your table by now: Systematic gene name, Popular gene name, ProbeSet, Case (arrayname), Control (arrayname), fc, log2fc
Gene expression lingo: fold change
The ratio we have just calculated between the two conditions is, within the world of transcriptomics, referred to as the fold change. The rationale is that the fold change (FC) immediately tells you something about the magnitude of regulation (e.g. 10-fold up-regulation). As we have seen it very useful to work with log2-transformed values, and one of the reasons for using log2 compared to log10, is that log2 is perfectly suited to count doublings. A log2(FC) value of "2" is the same as two times doubled = 4; A log2(FC) value of 3 -> 8 etc.
Gene Ontology Gene Set Enrichment Analysis
For this exercise, you will once again need the background of all the yeast genes measured with this particular microarray platform. You can load this from the week 5 exercise data.
Control array
TASK/REPORT QUESTION #7: top 100 CONTROL genes
- Make a list of the top 100 most expressed genes in the control study.
- Test for GO over-representation compared to the background list (as we did in the week 5 exercise) for "Biological Process"
- Paste the top 10 most significantly overrepresented biological processes into your report, and comment on the trends you see - is this expected/surprising?
The "fgsea" package we use for our GO over-representation analysis (ORA) can also do a functional class scoring (FCS), instead of the subset vs. background analysis we have been using so far.
For the functional class scoring analysis, you supply a single list of ranked genes (in most cases the entire genome), which has been sorted by some experimental criteria, e.g. gene expression values under a certain condition.
Here, it's important to realize that the list has two purposes:
- To provide a ranking of the genes.
- To provide the background distribution (it will also serve as a list of "all genes")
The over-representation analysis will then provide insight about the types of genes (and thus GO categories) that are "pushed" towards the ends of the list.
TASK/REPORT QUESTION #8: GO Functional Class Scoring
- Sort the gene list according to the expression in the control array.
- Perform a functional class scoring of GO terms (for Biological Process) using only the ranked list using the "fgsea" function from the "fgsea" package. For this purpose, you need to make a named list with the ranking metric (in this case, the expression values) and the gene names. See "?names".
- Include the top 10 most significantly over represented biological processes in your report (Use NES scores for filtering upregulated).
- Do the results make biological sense?
As a check on whether the gene set enrichment algorithm does a decent job with the FSC analysis, we'll investigate what happens if we feed it a randomly shuffled list of genes.
TASK/REPORT QUESTION #9: random "ranked" list
- Make a new named vector of the expression values, but this time randomize the names.
- What result do you expect to get from analyzing a randomly ordered gene list?
- Perform a functional class scoring of GO terms (Biological Process) - do the results match what you expected?
CASE array
TASK: rank based analysis of the case array
- Rank the gene list according to expression from the case array, and perform a functional class scoring of GO terms (Biological Process).
REPORT QUESTION #10:
- Can you explain the results? (Would you have expected otherwise since this is from the cells arrested in the G1 phase?).
Differential Expression - Fold change
TASK: GO over-rep analysis of Fold Change
- Rank the gene list on Log2(Fold Change), and perform a functional class scoring of GO terms.
REPORT QUESTION #11:
- Do the results make biological sense?
- Think about and explain why these results may differ from performing the same analysis on the CASE array alone.