ExYeastCellCycle answers
Question 1
Q:How much RNA ("total RNA") was used in the experiment? A: 50 ug of total RNA (found on the Protocols page).
Q: Which type/brand of array was used? A: Affymetrix GeneChip Yeast Genome 2.0 Array (found on the main page).
Q: How many individual arrays were used in the study? A: 2 arrays (found both on the main page and the samples page)
Q: IMPORTANT: note down which arrays were used for CONTROL (asynchronous; "mock treated") and which were use for CASE (arrested cells) A: (answers found on the samples page): CONTROL = GSM287991: mock-treated / asynchronous CASE = GSM287992: alpha factor arrested cells
Question 2
library(ggplot2) ggplot(expr, aes(x = GSM287992, y = GSM287991)) + geom_point()
Notice that the dots fall on the diagonal, and that the scales on X and Y are similar. The base assumption is that MOST of the genes do not vary between the two different conditions, and this is what we see here.
It is also this underlying assumption that makes it possible to normalize the data in the first place (to account for technical noise, e.g. slightly different amount of cDNA on each array, slight array-to-array variation etc). In conclusion, the data looks comparable.
The expression values span 4 orders of magnitude, the largest value being close to 40,000. As most of the values are in the lower ranger (as indicated on the plot), it is difficult to see much details in a plot with these dimensions.
Question 3
Plot after Log2 transformation of all the expression data:
ggplot(expr, aes(x = log2(GSM287992), y = log2(GSM287991))) + geom_point()
Question 4
(Theoretical question)
RATIO = CASE/CONTROL (we only consider values >0 for both CASE and CONTROL)
If CASE < CONTROL the RATIO will fall in the interval ]0;1[ If CASE > CONTROL the RATIO will fall in the interval ]1;inf[
The problem here is that these intervals are very far from being comparable. Up- and down-regulation of a given gene will be on very different scales.
Question 5
The trick is simply that we can use the Log2 function to transform the RATIOs.
Notice: Log2(1) = 0 Log2(x) -> -inf as x -> 0 Log2(x) -> +inf as x -> inf
Compared to the question above, we will now have the following intervals after transformation:
CASE < CONTROL: Log2(RATIO) will fall in the interval ]-inf;0[ CASE > CONTROL: Log2(RATIO) will fall in the interval ]0;inf[
Examples:
raw ratio: 7.1/2.3 = 3.07, log2 ratio: log2(7.1/2.3) = 1.63 raw ratio: 2.3/7.1 = 0.32, log2 ratio: log2(2.3/7.1) = -1.63
Question 6
expr$fc <- expr$GSM287992/expr$GSM287991 expr$log2fc <- log2(expr$GSM287992/expr$GSM287991)
Question 7
GO overrepresentation analysis of top 100 CONTROL genes.
load("home/projects/22140/exercise5.Rdata") BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP") BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name) fora_q7 <- fora(pathways = BP_list, genes = expr[order(expr$GSM287991, decreasing = TRUE),]$SysName[1:100], universe = background)
Observation: Lots of basic cell “house keeping” (e.g. rRNA synthesis, metabolic processes). Since this is the CONTROL sample of normally growing cells, it is expected that we should see all the basic processes need for cell growth.
Question 8
The same analysis as above, but now with the entire list of expression values as input.
stats <- expr$GSM287991; names(stats) <- expr$SysName gsea_q8 <- fgsea(BP_list, stats)
Same observation as above (but based on much more data – hence the more depth in the analysis): lots of basic metabolism.
Question 9
What happens if we randomize the order of the genes in the list?
stats <- expr$GSM287991; names(stats) <- sample(expr$SysName) gsea_q9 <- fgsea(BP_list, stats)
Answer: we lose the signal.
Question 10
Repeat the functional class scoring analysis for the CASE array
stats <- expr$GSM287992; names(stats) <- expr$SysName gsea_q10 <- fgsea(BP_list, stats)
Answer: it’s very difficult to tell it apart from the CONTROL sample. It’s important to understand why: Even if the cell are arrested in G1 a lot of other (normal) processes are going on – e.g. the cell still needs basic metabolism
Question 11
Functional class scoring of genes Log2 Fold Change
stats <- expr$log2fc; names(stats) <- expr$SysName gsea_q11 <- fgsea(BP_list, stats)
Observe the following: by sorting on the fold chance we focus the analysis on what is different between CASE and CONTROL. In more practical terms this means that all the normal metabolic processes that were dominating the analysis will disappear, and we can now see what is going on in relation to the G1 arrest.
Answer: the Karyogamy GO terms are overrepresented. This is in good agreement with the fact that the cells where arrested with alpha-factor, which induced the mating response. Karyogamy is the process where the A- and the alpha-cells fuse nuclei.