22140 - User contributions [en]

Heart Disease and Virtual Pulldown

2025-09-26T10:19:58Z

Krivi: /* Visualizing networks */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze, interpret and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/exercise4.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the (data via the graph_from_dataframe() function). No plotting needed now.

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Use ggraph to plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora(). What background do you need to use? (you already have all the information needed to decide and use this background)

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?

Heart Disease and Virtual Pulldown

2025-09-26T10:19:39Z

Krivi: /* Exercise on "Virtual Pulldowns" */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze, interpret and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/exercise4.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the (data via the graph_from_dataframe() function). No plotting needed now.

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora(). What background do you need to use? (you already have all the information needed to decide and use this background)

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?

Heart Disease and Virtual Pulldown

2025-09-20T16:26:42Z

Krivi: /* Exercise on "Virtual Pulldowns" */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze, interpret and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/exercise4.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora(). What background do you need to use? (you already have all the information needed to decide and use this background)

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?

Heart Disease and Virtual Pulldown

2025-09-19T07:12:08Z

Krivi: /* Functional classification */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze, interpret and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora(). What background do you need to use? (you already have all the information needed to decide and use this background)

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?

Heart Disease and Virtual Pulldown

2025-09-19T07:11:33Z

Krivi: /* Functional classification */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze, interpret and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora(). What background do you need to use? (you already have all information needed to descided this - no information is missing)

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?

Heart Disease and Virtual Pulldown

2025-09-19T07:10:11Z

Krivi: /* Human diseases / virtual pulldown exercise */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze, interpret and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora().

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?

Heart Disease and Virtual Pulldown

2025-09-19T07:09:30Z

Krivi: /* Functional classification */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora().

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #5:'''
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?

Heart Disease and Virtual Pulldown

2025-09-19T07:09:18Z

Krivi: /* Functional classification */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora().

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?

Heart Disease and Virtual Pulldown

2025-09-17T11:39:53Z

Krivi: /* Exercise on "Virtual Pulldowns" */

Heart Disease and Virtual Pulldown

2025-09-17T11:38:19Z

Krivi: /* Human diseases / virtual pulldown exercise */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using R to extract, analyze and visualize (sub)networks

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora().

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T11:27:50Z

Krivi: /* Visualizing networks */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora().

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T11:26:42Z

Krivi: /* Heart disease proteins */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora().

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T11:25:51Z

Krivi: /* Protein complex detection */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora().

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T11:25:38Z

Krivi: /* Functional classification */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora().

<pre>
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T11:16:45Z

Krivi: /* Heart disease proteins */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require vertex attribute "names" in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T11:15:46Z

Krivi: /* Heart disease proteins */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require matching vertex names in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T11:13:56Z

Krivi: /* Heart disease proteins */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

1: First you need to load the packages and data:

<pre>
# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

* You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.
** Hint: The unlist() function turns a list into a vector wile presevering all entries
* You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.

4: For step2 you can use this function:

<pre>
filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
# require matching vertex names in both graphs
if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
stop("Both graphs must have vertex 'name' attributes.")

mode <- "all"

vn <- V(subGraph)$name

deg_full <- degree(parentGraph, v = vn, mode = mode)
deg_internal <- degree(subGraph, mode = mode)

res <- data.frame(
node = vn,
deg_internal = deg_internal,
deg_full = deg_full,
frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
stringsAsFactors = FALSE
)

nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]

filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])

V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]

filteredSubGraph
}

</pre>

Start by annotating the code with comments (like the "# require matching vertex names in both graphs" comment). You should annotate what each part of the code does? If needed you can run parts of the code to see what the intermediary results are.

5: Try running with these cutoffs: 0, 0.5, 1

* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T10:50:11Z

Krivi: /* Heart disease proteins */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

# First you need to load the packages and data:

<pre>
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

# Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

# Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph, Step 2: Filter subgraph. Lets start with step 1. You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.

# Convert network into igraph object with the following relevancescore cutoffs: 0, 0.5, 1

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
g1 <- relevance_filtering(g, 0)
g2 <- relevance_filtering(g, 0.5)
g3 <- relevance_filtering(g, 1)
</pre>

# Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
# How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T10:48:16Z

Krivi: /* Heart disease proteins */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''

# First you need to load the packages and data:

<pre>
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

# Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data

# Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph, Step 2: Filter subgraph. Lets start with step 1. You can use the <pre>ego()</pre> function to get neighbors of all seed genes. Here you need to specify ''mode = "all"''.

# Convert network into igraph object with the following relevancescore cutoffs: 0, 0.5, 1

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
g1 <- relevance_filtering(g, 0)
g2 <- relevance_filtering(g, 0.5)
g3 <- relevance_filtering(g, 1)
</pre>

# Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
# How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T10:29:01Z

Krivi: /* Heart disease proteins */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''
# Load the packages
<pre>
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

# Run DiscoNet with this list of proteins with the following parameters:

<pre>
network_ex2 <- virtual_pulldown(seed_nodes = seeds, database = db, id_type = "hgnc", zs_confidence_score = 0.156)
interactions <- data.frame(network_ex2$network)
node_attributes <- data.frame(network_ex2$node_attributes)
</pre>

# Convert network into igraph object with the following relevancescore cutoffs: 0, 0.5, 1

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
g1 <- relevance_filtering(g, 0)
g2 <- relevance_filtering(g, 0.5)
g3 <- relevance_filtering(g, 1)
</pre>

# Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
# How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T10:27:43Z

Krivi: /* Heart disease proteins */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''
# Load the packages
<pre>
library(DiscoNet)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

# Run DiscoNet with this list of proteins with the following parameters:

<pre>
network_ex2 <- virtual_pulldown(seed_nodes = seeds, database = db, id_type = "hgnc", zs_confidence_score = 0.156)
interactions <- data.frame(network_ex2$network)
node_attributes <- data.frame(network_ex2$node_attributes)
</pre>

# Convert network into igraph object with the following relevancescore cutoffs: 0, 0.5, 1

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
g1 <- relevance_filtering(g, 0)
g2 <- relevance_filtering(g, 0.5)
g3 <- relevance_filtering(g, 1)
</pre>

# Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
# How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

Heart Disease and Virtual Pulldown

2025-09-17T10:27:22Z

Krivi: /* Human diseases / virtual pulldown exercise */

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''
# Load the packages
<pre>
library(DiscoNet)
library(msigdbr)
library(fgsea)

The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

# Run DiscoNet with this list of proteins with the following parameters:

<pre>
network_ex2 <- virtual_pulldown(seed_nodes = seeds, database = db, id_type = "hgnc", zs_confidence_score = 0.156)
interactions <- data.frame(network_ex2$network)
node_attributes <- data.frame(network_ex2$node_attributes)
</pre>

# Convert network into igraph object with the following relevancescore cutoffs: 0, 0.5, 1

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
g1 <- relevance_filtering(g, 0)
g2 <- relevance_filtering(g, 0.5)
g3 <- relevance_filtering(g, 1)
</pre>

# Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
# How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

DiscoNet

2025-09-17T10:26:50Z

Krivi: Krivi moved page DiscoNet to Heart Disease and Virtual Pulldown: renaming

#REDIRECT [[Heart Disease and Virtual Pulldown]]

Heart Disease and Virtual Pulldown

2025-09-17T10:26:50Z

Krivi: Krivi moved page DiscoNet to Heart Disease and Virtual Pulldown: renaming

= Human diseases / virtual pulldown exercise =
'''Exercise written by:''' Lars Rønn Olsen, Giorgia Moranzoni, and Rasmus Wernersson

'''Learning objectives:'''
* Overall objective: learn how to extract meaningful networks from human PPI data
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R

== Introduction ==
[[Image:Heart+networks.png|600px|right|thumb|'''Click to zoom''' - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage ''et al'', 2010)]]

=== The network neighborhood as an indication of function ===
As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a '''group''' of potentially associated proteins.
=== Disease gene/protein networks ===
The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several '''sub-networks''' describing different components of the disease.

There are several benefits from this type of analysis - most importantly:
* Identification of novel disease-related genes/proteins
* Generating hypotheses about the molecular biology behind a disease

=== Knowing where to look ===
The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the '''close neighborhood''' of the proteins.

In this exercise we'll be working with two different approaches to this:
# Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
# Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins
 

== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network

For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:

# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented

=== Heart disease proteins ===
We'll start out with a set of proteins known to be involved in '''[https://en.wikipedia.org/wiki/Atrioventricular_septal_defect|Abnormal atrioventricular canal morphology]''' ("AACM"):

<pre>
seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #1:'''
# Load the packages
<pre>
library(DiscoNet)
library(msigdbr)
library(fgsea)

The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
</pre>

# Run DiscoNet with this list of proteins with the following parameters:

<pre>
network_ex2 <- virtual_pulldown(seed_nodes = seeds, database = db, id_type = "hgnc", zs_confidence_score = 0.156)
interactions <- data.frame(network_ex2$network)
node_attributes <- data.frame(network_ex2$node_attributes)
</pre>

# Convert network into igraph object with the following relevancescore cutoffs: 0, 0.5, 1

<pre>
g <- graph_from_data_frame(interactions, directed = FALSE, vertices = node_attributes)
g1 <- relevance_filtering(g, 0)
g2 <- relevance_filtering(g, 0.5)
g3 <- relevance_filtering(g, 1)
</pre>

# Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
# How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

=== Visualizing networks ===

'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report

=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:

<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>

'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.

=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.

This can be done with the fgsea package.

Start by loading the background gene list:

<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>

Run fora on all potential protein complexes:

<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)

fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?

ExGeneOntology R

2025-09-17T10:16:10Z

Krivi: /* Enrichment analysis - reexamining cluster #1 */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes that had the possiblity of ending in the final list (e.g. all genes tested for statistical significance). We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7:''' Assuming a total of '''5500''' annotated genes (background for this exercise) calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)
# What will happen to the P-value and odds ratio (calculated by <code>fisher.test()</code>) if the background was 500 genes,100,000 genes? Calculate the result and comment on it. What does this mean for the choice of background?



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #8:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #10:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-11T12:13:27Z

Krivi: /* Enrichment analysis - reexamining cluster #1 */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes that had the possiblity of ending in the final list (e.g. all genes tested for statistical significance). We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7:''' Assuming a total of '''5500''' annotated genes (background for this exercise) calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)
# What will happen to the P-value and odds ratio (calculated by <code>fisher.test()</code>) if the background was 500 genes,100,000 genes? What does this mean for the choice of background?



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #8:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #10:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-11T11:52:07Z

Krivi: /* Enrichment analysis - reexamining cluster #1 */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes that had the possiblity of ending in the final list (e.g. all genes tested for statistical significance). We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7:''' Assuming a total of '''5500''' annotated genes (background for this exercise) calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)
# What will happen to the P-value and odds ratio (calculated by <code>fisher.test()</code>) if the background was 1000 genes,100,000 genes?



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #8:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #10:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-11T11:50:16Z

Krivi: /* Enrichment analysis - reexamining cluster #1 */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes that had the possiblity of ending in the final list (e.g. all genes tested for statistical significance). We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7:''' Assuming a total of '''5500''' annotated genes (background for this exercise) calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #8:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #10:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-11T11:49:13Z

Krivi: /* Part 2: Gene Ontology overrepresentation analysis (ORA) */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes that had the possiblity of ending in the final list (e.g. all genes tested for statistical significance). We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #8:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #10:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-10T08:36:36Z

Krivi: /* Repeat analysis on selected clusters */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #8:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #10:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-10T08:36:27Z

Krivi: /* Preparing input data */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #8:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-10T08:35:52Z

Krivi: /* Enrichment analysis - reexamining cluster #1 */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-10T08:35:14Z

Krivi: /* UniProt */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-10T08:34:37Z

Krivi: /* Repeat analysis on selected clusters */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the other clusters in the interaction data. Randomly pick another cluster.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** What is the likely function of this cluster?**

ExGeneOntology R

2025-09-10T08:32:34Z

Krivi: /* Preparing input data */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

 

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:31:38Z

Krivi: /* Preparing input data */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise3_part1.Rdata") # existing network data
load("/home/projects/22140/exercise3_part2.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:30:22Z

Krivi: /* Automated analysis using "fgsea" and "msigdbr" */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from an existing interaction network. You can find clusters 1-8 in the node attribute table.

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata") # existing network data
load("/home/projects/22140/exercise5.Rdata") # background
</pre>

''Hint'': You can see which objects exist in your current R session through RStudio's "Envroment" tab.

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:19:58Z

Krivi: /* Introducing over representation in R */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of all annotated genes (background list). The <code>fora()</code> function from '''fgsea''' package can be used do an overrepressentation analysis (incl. p-values) for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:17:58Z

Krivi: /* Enrichment analysis - reexamining cluster #1 */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term (use <code>fisher.test()</code> in RStudio)



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:16:50Z

Krivi: /* Enrichment analysis - reexamining cluster #1 */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming a total of '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:15:59Z

Krivi: /* Enrichment analysis - reexamining cluster #1 */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins below are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| POL12||DNA polymerase alpha subunit B||X||||
|-
| POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| PRI1||DNA primase small subunit||X||||
|-
| PRI2||DNA primase large subunit||X||||
|-
| RFC1||Replication factor C subunit 1||X||X||X
|-
| RFC2||Replication factor C subunit 2||X||X||X
|-
| RFC3||Replication factor C subunit 3||X||X||X
|-
| RFC4||Replication factor C subunit 4||X||X||X
|-
| RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:14:14Z

Krivi: /* Part 2: Gene Ontology overrepresentation analysis */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis (ORA)=

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a RNAseq experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:09:01Z

Krivi: /* UniProt */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:08:45Z

Krivi: /* UniProt */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
** How does the information here compare to the information in the ''keywords'' section?
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:07:31Z

Krivi: /* Part 1b: GO annotations on Genes and Proteins */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:05:52Z

Krivi: /* Part 1b: GO annotations on Genes and Proteins */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

= Part 1b: GO annotations on Genes and Proteins =
As was mentioned earlier, Gene Ontology was created to provide a '''standardized''' set of "keywords" for annotating the function of genes and proteins. We'll now have a look at how GO is actually used in large sequence databases.

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:05:43Z

Krivi: /* Part 1b: GO annotations on Genes and Proteins */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

= Part 1b: GO annotations on Genes and Proteins =
As was mentioned earlier, Gene Ontology was created to provide a '''standardized''' set of "keywords" for annotating the function of genes and proteins. We'll now have a look at how GO is actually used in large sequence databases.

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:03:14Z

Krivi: /* Cellular Component examples */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

This concept is equivivalent to species taxonomy - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

= Part 1b: GO annotations on Genes and Proteins =
As was mentioned earlier, Gene Ontology was created to provide a '''standardized''' set of "keywords" for annotating the function of genes and proteins. We'll now have a look at how GO is actually used in large sequence databases.

== Saccharomyces Genome Database ==

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: Look the entry for POL1 (YNL102W) in [http://www.yeastgenome.org SGD]'''
* Notice that all Saccharomyces Genome Database (SGD) entries have an entire section on Gene Ontology annotations; click on the "Gene Ontology" tab for full details. This actually include a bit of extra information about the '''evidence''' for annotations.
* What is the Molecular Function for POL1?
* Click on the link for this term to see how SGD describes the GO term, and how the evidence is presented.
** How many other yeast genes are ALSO annotated to have "DNA-directed DNA polymerase activity"?

'''IMPORTANT:''' SGD also offers the possibility to jump from their website to the same GO term inside AmiGO. This is very useful for investigating the hierarchy of GO terms "above" - SGD has limited functionality for this.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7: Follow the link to AmiGO'''
* Follow the link to AmiGO for the Molecular Function term found above, and answer the following question:
** Does POL1 have "Transferase Activity"? (Which GO term).

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:01:39Z

Krivi: /* Example: "Cell division" */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

It's much the same case as we have previously seen with taxonomy in course 27611 - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

= Part 1b: GO annotations on Genes and Proteins =
As was mentioned earlier, Gene Ontology was created to provide a '''standardized''' set of "keywords" for annotating the function of genes and proteins. We'll now have a look at how GO is actually used in large sequence databases.

== Saccharomyces Genome Database ==

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: Look the entry for POL1 (YNL102W) in [http://www.yeastgenome.org SGD]'''
* Notice that all Saccharomyces Genome Database (SGD) entries have an entire section on Gene Ontology annotations; click on the "Gene Ontology" tab for full details. This actually include a bit of extra information about the '''evidence''' for annotations.
* What is the Molecular Function for POL1?
* Click on the link for this term to see how SGD describes the GO term, and how the evidence is presented.
** How many other yeast genes are ALSO annotated to have "DNA-directed DNA polymerase activity"?

'''IMPORTANT:''' SGD also offers the possibility to jump from their website to the same GO term inside AmiGO. This is very useful for investigating the hierarchy of GO terms "above" - SGD has limited functionality for this.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7: Follow the link to AmiGO'''
* Follow the link to AmiGO for the Molecular Function term found above, and answer the following question:
** Does POL1 have "Transferase Activity"? (Which GO term).

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?

ExGeneOntology R

2025-09-10T08:00:44Z

Krivi: /* Example: "Cell division" */

= Gene Ontology - cell cycle examples =
[[Image:QuickGO_Mitochondrion.png|thumb|right|250px|'''Cellular component''' example: the GO term '''mitochondrion''']]
'''Exercise written by:''' Rasmus Wernersson & Kristoffer Vitting-Seerup

'''Purpose of this exercise:'''
* Understand how Gene Ontology terms are defined and organized:
** The relationship between GO terms (IS A, PART OF, etc)
** The three main trunks of GO: BIOLOGICAL PROCESS, MOLECULAR FUNCTION and CELLULAR COMPONENT.
* Learn how to query the Gene Ontology database.
** Using the official online GO query system: AmiGO
** Using links from UniProt.
* Understand the theory behind GO over-representation analysis
* Learn how to perform GO over-representation analysis:
** Using the R package "fgsea"

= Part 1a: using the Gene Ontology terms and tools =
The '''Gene Ontology''' database contains a collection of '''strict definitions''' of '''biological terms''', and information about how the terms relate to each other (for example '''DNA replication''' is a '''biosynthetic process''' which in turn is a '''biological process'''). The Gene Ontology system is divided into three main trunks:

# '''Biological Process''' (e.g. DNA replication)
# '''Molecular Function''' (e.g. DNA binding)
# '''Cellular Component''' (e.g. Nucleus).

Each term has a UNIQUE IDENTIFIER - much in the same way, as we have it with gene ids in Ensembl (e.g. ENSG00000141510) and protein identifiers in UniProt (e.g. P53_HUMAN). The Gene Ontology was created to provide a standardized way to characterize the functionality of '''genes''' (hence the name) and '''gene products''' (protein). The idea is much the same as with UniProt keywords: to have a standard set of '''labels''' that can be used to describe the gene/protein functionality: this will both 1) make it much easier to search gene/protein databases, and 2) make it much easier to perform '''large scale''' comparisons of genes/proteins.

'''LINKS:'''

* '''Database look-up:'''
** http://www.geneontology.org - the home of the Gene Ontology project
** http://amigo.geneontology.org - the '''AmiGO''' search system (the official search engine for the project)
** http://www.ebi.ac.uk/QuickGO - An alternative search system for Gene Ontology, created and maintained by the EBI. They provide a nice graphical representation of the GO trees.
* '''Overrepresentation analysis:'''
** The R package [https://bioconductor.org/packages/release/bioc/html/fgsea.html,"fgsea"] for over representation analysis, and the R package [https://cran.r-project.org/web/packages/msigdbr/index.html, "msigdbr"] for retrieving genesets. '''Note''': These packages are already installed in the RStudio server - no need to do that yourself.

== Example: "Cell division" ==
Gene Ontology provides a wealth of information, to the point where it can be a bit intimidating at first (it can be difficult to see the forest for all the trees). Before we start browsing the full online database, we will start out with a simple example, where we'll highlight some of the most important features, and for a moment hide the rest:

{| border="1" cellpadding="5" cellspacing="0"
|+ '''Term: Cell division'''
|'''Accession'''
|GO:0051301
|rowspan="4" | [[Image:QuickGO_CellDivision_cropped.png]]
|-
|'''Ontology'''
|Biological Process
|-
|'''Definition'''
|The process resulting in division and partitioning of components of a cell to form more cells.
|-
|'''Comment'''
|Note that this term differs from 'cytokinesis ; GO:0000910' in that cytokinesis does not include nuclear division.
|}

Example of the definition of a GO term ("Cell division" ; GO id: 0051301) in the '''biological process''' category, and a graph showing it's relationship to other GO terms. Note that there are 7 different types of relationships defined. The most common one is the '''IS A''' relationship.

'''TASK: investigate "cell division" using [http://amigo.geneontology.org '''AmiGO 2''']
* Select Search -> Ontology from the top menu.
* Search for the term "cell division", and click on the entry for "cell division" at the top of the results list.
* Spend some time getting familiar with the entry page:
** The top part contains the definition(s) related to this particular entry.
** The lower part contains information about how this entry relates to OTHER entries. Try clicking on the various tabs, such as "Graph Views".

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #1:'''
* How many ancestor terms are defined? With how many different types of relationships?
* How many children terms are defined? With how many different types of relationships?

== Cellular Component examples ==
[[Image:Animal_cell_structure_en.png|thumb|300px|Examples of subcellular components from an animal cell. Image Source: [http://en.wikipedia.org/wiki/File:Animal_cell_structure_en.svg Wikipedia].]]
The "Cellular Component" part of Gene Ontology is good for illustrating the concept of '''nested terms''' in more details, since it's easy to visualize the boxes-in-boxes concept here.

'''For example:''' The '''Nucleolus''' (the organelle for synthesis and maturation of Ribosomal RNA) is located WITHIN the '''nucleus''' which is located WITHIN the '''cell'''. While this seems trivial and evidently true, it's important to realize the concept of '''inherited properties''' within a hierarchical structure.

It's much the same case as we have previously seen with taxonomy in course 27611 - for example, if you look up Human (''Homo sapiens'') and Mouse (''Mus musculus'') in NCBI Taxonomy, the abbreviated lineages look like this:

<pre style="overflow:auto;">
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Primates › Hominidae › Homo
Eukaryota › Metazoa › Chordata › Craniata › Vertebrata › Mammalia › Eutheria › Muroidea › Mus
</pre>

In the GO terminology these are '''IS A''' relationships: From this we can see that all humans are primates, and all primates are mammals, but all mammals are NOT (necessarily) human.

'''TASK: investigate the nucleus in GO'''
* Look up "nucleus" (GO:0005634) in AmiGO.
**''Hint: if you get a lot of hits that are not what you are looking for, try putting your query inside quotation marks ("").''

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2:'''
* Is "nucleus" a membrane-bound, or non membrane-bound organelle? On which linked GO terms do you base this conclusion?
* What types of relationships are found?

'''IS A''' vs. '''PART OF''': So far we have been focusing on the "'''IS A'''" relationship (nucleus IS A organelle which in turn IS A cellular component). However, things get a bit more complicated, when we bring the "PART OF" relationship into the picture, as the next example will show.

== Molecular Function examples ==
Continuing with our focus on cell cycle and DNA replication, investigate the following terms:

# DNA Polymerase Activity
# Helicase Activity

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #3:'''
answer the following questions:

* Can the activities described be directed towards both DNA and RNA? Remember add your arguments
* At which node does the "tree" branch out of the "Molecular Function" ontology and into a different ontology? With what type of relationship? Does this make biological sense?

== (A few more) Biological Process examples ==
[[Image:QuickGO_negative_regulation_example.png|right|thumb|300px|Example of '''negative regulation''']]
Next up we'll investigate how the different phases of the cell cycle have been categorized in GO.

* Start out by looking up the entry for the G1 phase: '''GO:0051318'''

From here you'll need to investigate the "neighborhood" of terms to answer the questions below - ask the instructor for help if you get stuck.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #4:''' '''Mitosis related questions:''' (ignore meiosis for now)
* How many (if any) '''cell cycle''' sub-phases are defined for:
** G1 phase
** S phase
** G2 phase
** M phase
* Which phases are grouped together into the "interphase" term?

[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #5:''' '''Meiosis related question:''' During meiosis the sister chromatids can exchange DNA in a process called "homologous chromosome pairing at meiosis" - you may have heard this described as "Chromosome Crossover" as well (technically crossing over is the '''method''' by with the '''process''' happens).
* In which meiotic cell cycle phase does '''homologous chromosome pairing at meiosis''' happen?

= Part 1b: GO annotations on Genes and Proteins =
As was mentioned earlier, Gene Ontology was created to provide a '''standardized''' set of "keywords" for annotating the function of genes and proteins. We'll now have a look at how GO is actually used in large sequence databases.

== Saccharomyces Genome Database ==

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #6: Look the entry for POL1 (YNL102W) in [http://www.yeastgenome.org SGD]'''
* Notice that all Saccharomyces Genome Database (SGD) entries have an entire section on Gene Ontology annotations; click on the "Gene Ontology" tab for full details. This actually include a bit of extra information about the '''evidence''' for annotations.
* What is the Molecular Function for POL1?
* Click on the link for this term to see how SGD describes the GO term, and how the evidence is presented.
** How many other yeast genes are ALSO annotated to have "DNA-directed DNA polymerase activity"?

'''IMPORTANT:''' SGD also offers the possibility to jump from their website to the same GO term inside AmiGO. This is very useful for investigating the hierarchy of GO terms "above" - SGD has limited functionality for this.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #7: Follow the link to AmiGO'''
* Follow the link to AmiGO for the Molecular Function term found above, and answer the following question:
** Does POL1 have "Transferase Activity"? (Which GO term).

== UniProt ==
UniProt uses a lot of its own annotation - for example the UniProt '''keywords''' we learned to use in course '''22111'''. However, the protein entries are also annotated with GO terms.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #8: look up human POLD1 in [http://www.uniprot.org UniProt]''':
* Entry name: DPOD1_HUMAN or P28340.
* Locate the "Function" and "Subcellular Location" sections.
* Gene Ontology:
** Select a GO term you think looks interesting and click on it - this will take you to EBI's own interface to GO.
** Click through the tabs, and see what type of information is there.
** Notice that there is NO direct link to AmiGO, but you can always copy+paste the GO identifier into AmiGO, if you want to investigate the term using a more familiar interface.
* Questions (requires a bit of detective work - ask the instructor if you get stuck):
** How many UniProt proteins are annotated with "GO:0045004 DNA replication proofreading"?
** How many of these are human? (TaxID: 9606 - use the filtering function, if you don't want to count them).

= Part 2: Gene Ontology overrepresentation analysis =

In this part of the exercise, we focus on performing analysis for overrepresentation in GO. We are interested in knowing if a chosen subset of genes/proteins has any special characteristics, compared to what would be expected if we picked a similar sized subset '''randomly''' from the entire pool of genes/proteins.

'''Study group:'''
* The subset (study group) could be a set of overexpressed genes from a microarray experiment, or simply a list of genes which you for other reasons expect to be involved in the same biological process.
'''Population group:'''
* The population group would then be defined as the '''background''' to compare to - e.g. the entire list of genes, or in the case of gene expression all genes represented on the microarray. We can then ask if the frequency of proteins annotation to a GO term is different for the study group compared to the overall population.

== Enrichment analysis - reexamining cluster #1 ==
[[Image:Cluster1_subnetwork.png|thumb|200px|right]]

Before we move on to the more advanced statistical methods, we'll spend a moment answering the following questions:
* ''Is the '''observed frequency''' of a given characteristic different from the '''expected frequency''' ''?

The steps to do this is simply to
# Calculate the frequency across the entire '''population group''' (X number of genes with the characteristic in a total population of Y: FX = X/Y).
# From this frequency calculate expected genes/protein with this characteristic in the '''study group''' (n = size of study group; exp = FX * n)
# Compare to the observed frequency
# The '''enrichment''' is then calculated as the ratio with observed/expected

We'll go through this with an example: in this case we want to examine whether the proteins in Cluster #1 (from [[ExYeastSysBio1|last week's exercise]]) are overrepresented in any of the following three GO terms: "DNA replication", "DNA repair" and "Cell cycle".

'''All proteins in Cluster #1 are listed in the table below with a mark if they are associated with a GO term:'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''Systematic name'''
| align="center" style="background:#a0a0f0;"|'''Gene name'''
| align="center" style="background:#a0a0f0;"|'''Description'''
| align="center" style="background:#a0a0f0;"|'''DNA replication (GO:0006260)'''
| align="center" style="background:#a0a0f0;"|'''DNA repair (GO:0006281)'''
| align="center" style="background:#a0a0f0;"|'''Cell cycle (GO:0007049)'''
|-
| YMR078C||CTF18||Chromosome transmission fidelity protein 18||X||X||X
|-
| YPR175W||DPB2||DNA polymerase epsilon subunit B||X||X||X
|-
| YBR278W||DPB3||DNA polymerase epsilon subunit C||X||X||
|-
| YBL035C||POL12||DNA polymerase alpha subunit B||X||||
|-
| YNL102W||POL1||DNA polymerase alpha catalytic subunit A||X||||
|-
| YNL262W||POL2||DNA polymerase epsilon catalytic subunit A||X||X||
|-
| YOR144C||ELG1||Telomere length regulation protein ELG1||X||X||X
|-
| YPR167C||MET16||Phosphoadenosine phosphosulfate reductase||||||
|-
| YIR008C||PRI1||DNA primase small subunit||X||||
|-
| YKL045W||PRI2||DNA primase large subunit||X||||
|-
| YOR217W||RFC1||Replication factor C subunit 1||X||X||X
|-
| YJR068W||RFC2||Replication factor C subunit 2||X||X||X
|-
| YNL290W||RFC3||Replication factor C subunit 3||X||X||X
|-
| YOL094C||RFC4||Replication factor C subunit 4||X||X||X
|-
| YBR087W||RFC5||Replication factor C subunit 5||X||X||X
|-
| YCL042W||YCL042W||Putative uncharacterized protein YCL042W||||||
|}

In the table below are the '''total number''' of proteins listed that are involved in the three GO categories "DNA replication", "DNA repair" and "Cell cycle" across the '''entire yeast genome'''

{| border="1" cellpadding="5" cellspacing="0"
| align="center" style="background:#a0a0f0;"|'''GO term'''
| align="center" style="background:#a0a0f0;"|'''# genes (including subgroups)'''
|-
| DNA replication (GO:0006260)||96
|-
| DNA repair (GO:0006281)||259
|-
| Cell cycle (GO:0007049)||313
|}

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #9:''' Assuming yeast has '''5500''' annotated genes calculate/report the following values:
# Population group size
# Study group size
# Genome wide frequency of each GO term
# Expected number of genes annotated with each term in a '''random''' selection of yeast genes of the same size as cluster #1
# The '''enrichment''' of observed GO terms compared to expected
# The p-value for each GO term
# The biological interpretation of the analysis

The p-values can be calculated using an online calculator such as [http://graphpad.com/quickcalcs/contingency1/ this one].
* '''2021 update:''': We are testing a new online calculator this year: https://www.medcalc.org/calc/fisher.php



== Automated analysis using "fgsea" and "msigdbr" ==
=== Introducing over representation in R ===

For the final part of the exercise, we'll be using an automated tool for comparison of an '''input gene list''' (target list) against a background distribution consisting of the entire yeast genome (background list). The "fora" function from "fgsea" can be used to calculate p-values for all gene sets within a list of gene sets. We can use the "msigbdr" package to download gene sets for:

* Biological Processes
* Molecular Functions
* Cellular Components

The results are returned as tables with p-values, gene set size, and the size of the overlap between the gene set and the target list. Finally it's worth mentioning, that the "fora" function also takes care of '''multiple testing correction''', an important problem for large scale data analysis, which we will re-visit in greater details in a later exercise.

=== Preparing input data ===
First we need to prepare our input data. An over representation analysis, we need '''three inputs''' 1) a target gene list of interest, 2) a background gene list, and 3) the gene sets we wish to examine for over representation.

'''Target gene list'''

we'll use '''Cluster #1''' from last week's exercises as example. You can find clusters 1-8 in the node attribute table from last week's exercise, and clusters 9-10 in your solutions (also included in the Rdata object for this week's exercises).

'''Background list''' ("population group")

The background here will be the entire yeast genome - a list containing ALL yeast gene names. We have prepared such a list, and included it in the exercise5.Rdata object.

<pre style="overflow:auto;">
load("/home/projects/22140/exercise4.Rdata")
load("/home/projects/22140/exercise5.Rdata")
</pre>

'''Gene ontology gene sets'''

The gene ontology gene sets can be downloaded from the "molecular signatures database" (msigdb) using the R package "msigdbr" using the following command:

<pre style="overflow:auto;">
library(msigdbr)
BP_df = msigdbr(species = "S. cerevisiae", category = "C5", subcategory = "BP")
</pre>

This command retrieves biological process gene sets for yeast. The "C5" category is the msigdb annotation for the three gene ontologies. The subcategory is either "BP" (biological process), "MF" (molecular function), or "CC" (cellular component).

Before we proceed, take a moment to explore the functions "fora" and "msigdbr". In particular, take a look at what object class the "fora" function expects the gene sets to be, and what object class is produced by "msigdbr" with the command above?

'''TASK/REPORT QUESTION #11:''' Can you directly use the gene sets produced by the msigdbr function as input for the "fora" function of "fgsea"? Why/why not?

Run the following one-liner to prepare the biological process gene sets for over representation analysis using "fora".

<pre style="overflow:auto;">
BP_list = split(x = BP_df$ensembl_gene, f = BP_df$gs_name)
</pre>

'''Running "fora"'''
Take a look at the background list and any one of the gene sets of the BP_list. Then, extract cluster 1 genes from the node annotation table as your target list, taking care to include the gene identifier that matches the identifier used in the gene sets and background.

Take a look at the results. As mentioned, "fora" produces a table with p-values, adjusted p-values, gene set size, and the size of the overlap between the gene set and the target list, but ''not'' the enrichment.

Calculate the enrichment and add a column to your results table.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' Run the analysis for cluster #1 for all three GO trunks (Biological Process, Molecular Function and Cellular Component), and include top 10 significantly enriched GO terms for each ontology. Add a section to your report discussing if the results fit with what we have previously learned about the function of cluster #1.

== Repeat analysis on selected clusters ==
[[Image:Cogs_brain.png|50px]] As the final task of today's exercise we'll be re-visiting the '''10 clusters''' from last week's exercise. Pick the '''3 clusters''' which appear the most interesting to you.

[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #12:''' perform the following over-representation analysis and create a short report documenting you finding:
* Biological Process
* Molecular function
* Cellular component
* Question:
** Do the results make biological sense?
** How do these results compare to the broad categories we observed in the previous exercise?