Heart Disease and Virtual Pulldown: Difference between revisions

From 22140
Jump to navigation Jump to search
 
(13 intermediate revisions by the same user not shown)
Line 6: Line 6:
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
*# Finding tightly connected clusters in larger networks
*# Finding tightly connected clusters in larger networks
*# Using the DiscoNet package in R
*# Using R to extract, analyze, interpret and visualize (sub)networks


== Introduction ==
== Introduction ==
Line 35: Line 35:


== Exercise on "Virtual Pulldowns" ==
== Exercise on "Virtual Pulldowns" ==
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. As we have been going through in much more details in the lecture, InWeb was build by:
For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.
# Transferring interactions from model organisms to human by orthology (if a pair of interacting proteins have strong orthologs in human, the interaction is transferred)
# '''Scoring''' the reliability of the interactions. This allows for filtering out interactions with little experimental support, thus building a high-confidence network
 
For querying the InWeb we'll use the "DiscoNet" R package, which works as follows:
 
# Input: a list of protein believed to be involved in a particular biology (here: different heart diseases/developmental stages)
# For each protein, 1st order interaction partners are found
# For all input proteins and all 1st order interaction partners, a combined network is built
# For the combined network a series of '''scored subnetworks''' are build in order to filter away "sticky proteins" (as we talked about in the lecture)
# An overrepresentation analysis is performed for each complex using the fgsea package
# Finally a visual representation of the network is presented


=== Heart disease proteins ===
=== Heart disease proteins ===
Line 67: Line 56:


# The PPI database we will use is InWeb:
# The PPI database we will use is InWeb:
load(file='/home/projects/22140/inweb_reduced.Rdata')
load(file='/home/projects/22140/exercise4.Rdata')
</pre>
</pre>


2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the data
2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the (data via the graph_from_dataframe() function). No plotting needed now.




Line 116: Line 105:
Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.
Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. ''Hint'': If needed you can run parts of the code to see what the intermediary results are.


5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.5, 1
5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1


* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
* Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
Line 123: Line 112:
=== Visualizing networks ===
=== Visualizing networks ===


'''TASK:''' Get ready to visualize the three graphs (relevance score cutoffs 0, 0.5, 1) using ggraph.


[[Image:Office-notes-line_drawing.png|30px|left]]
[[Image:Office-notes-line_drawing.png|30px|left]]
'''REPORT QUESTION #2":
'''REPORT QUESTION #2":
* Include screenshots of the networks in your report
* Use ggraph to plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.
 
=== Protein complex detection ===
Next up, we will use the MCODE algorithm to detect potential protein complexes using a 0.0 relevance filtering cutoff. The can be done with the "community_detection" function of DiscoNet:
 
<pre>
communities <- community_detection(g1, algorithm = "mcode")
</pre>
 
 
'''REPORT QUESTION #3":
Examine the resulting communities. Which one do you think may be molecular complexes and why? Paste an example of a community you believe could be a protein complex, and one you don't believe is a protein complex.


=== Functional classification ===
=== Functional classification ===
For the next part, we'll try to identify the function of the proteins we have found by performing Gene Ontology over-representation analysis of sub-clusters with-in the network.
Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora(). What background do you need to use? (you already have all the information needed to decide and use this background)
 
This can be done with the fgsea package.
 
Start by loading the background gene list:


<pre>
<pre>
load("/home/projects/22140/exercise9.Rdata")
</pre>
Run fora on all potential protein complexes:
<pre>
library(fgsea)
library(msigdbr)
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)
fora(pathways = BP_list, genes = V(communities$communities[['COMMUNITY NUMBER']])$name, universe = all_gene_ids)
</pre>
</pre>


[[Image:Office-notes-line_drawing.png|30px|left]]
[[Image:Office-notes-line_drawing.png|30px|left]]
'''TASK/REPORT QUESTION #4:'''
'''TASK/REPORT QUESTION #4:'''
* Discuss the interpretation of the most significant results for each of the communities that could be protein complexes. Do they make biological sense in the context of heart disease?
* Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?


<!---
[[Image:Office-notes-line_drawing.png|30px|left]]
=== Analysis of heart developmental disease networks ===
'''TASK/REPORT QUESTION #5:'''
[[Image:phenotype_groups.png|500px|right|thumb|Phenotype groups]]
* Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?
[[Image:Cogs_brain.png|50px]]
'''Final task/report question''': In the final part of the Virtual Pulldown exercise, your task is to select '''3 different sets of heart disease genes''' from the Lage ''et al'' (2010) study (data in excel file below) and do the following analyses:
# Create and download the virtual pulldown networks
# IMPORTANT: Create a NEW Cytoscape session.
# Import the networks into Cytoscape (either start a new session, or give the networks new names - otherwise Cytoscape gets confused).
#* Advanced: try to import the XML version instead of the SIF version (ask the instructor for help if needed); this can save you some time.
# Include a screenshot of the network in your report.
# Try to identify sub-networks in the network (by "eyeballing" the clusters), and perform a functional analysis of the proteins contained.
#* Report lists of proteins in the selected sub-networks.
#* Report Gene Ontology over-representation analysis for both Biological Process and Molecular Function.
#* Discuss and compare the results from the over-repressentation analysis.
 
Excel sheet with the heart disease gene lists:
* [https://teaching.healthtech.dtu.dk/27040/exercises/HeartDiseaseGenes.xlsx HeartDiseaseGenes.xlsx]
--->

Latest revision as of 12:19, 26 September 2025

Human diseases / virtual pulldown exercise

Exercise written by: Lars Rønn Olsen, Giorgia Moranzoni, Rasmus Wernersson and Kristoffer Vitting-Seerup

Learning objectives:

  • Overall objective: learn how to extract meaningful networks from human PPI data
    1. Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
    2. Finding tightly connected clusters in larger networks
    3. Using R to extract, analyze, interpret and visualize (sub)networks

Introduction

Click to zoom - example of the type of network characterization we aim at doing in this exercise. Seed proteins are colored yellow. (From Lage et al, 2010)

The network neighborhood as an indication of function

As we have seen before, the function of a protein can often be inferred from the function of the proteins interacting with it. Previously we have been using this to look at the function of proteins of which little is known. Today we'll take this a step further by investigating what we can learn from the interaction partner for a group of potentially associated proteins.

Disease gene/protein networks

The basic idea is to start out with a list of genes/proteins that are known (or expected) to be involved in the same disease, even if the exact molecular basis of the disease is not understood. The hypothesis is that many of the disease-associated genes are likely to be involved in networks leading to the same phenotype (here: disease).

In the most simple case, all known disease related genes, might be part of a single network with a crystal-clear link to the disease. However, this is rarely the case.

What is more commonly the case is the situation shown in the figure to the right: the known disease related genes/protein ("seed proteins") are involved in several sub-networks describing different components of the disease.

There are several benefits from this type of analysis - most importantly:

  • Identification of novel disease-related genes/proteins
  • Generating hypotheses about the molecular biology behind a disease

Knowing where to look

The human interactome is HUGE. The theoretical maximum is all 20.000+ proteins interacting with each other (200 million interactions), but even the fraction we have experimental evidence for, is in the order of several hundreds of thousands of interactions between 12.000+ proteins.

As we have seen before when we looked at network properties, it will be possible to connect most nodes (here: proteins) in the network with very few steps. This is also the case with the human interactome, which taken as a whole is one giant interconnected hairball. In order to learn anything about the function of the disease genes/proteins in question, it's important to restrict what we're looking at to the close neighborhood of the proteins.

In this exercise we'll be working with two different approaches to this:

  1. Building a network "bottom-up" - sampling the 1st order interaction partners for a list of input proteins ("Virtual pulldowns")
  2. Topology based clustering on large input networks, followed by a search for cluster enriched in disease-related proteins


Exercise on "Virtual Pulldowns"

For this part of the exercise, we'll be using a resource created here at DTU: the "InWeb" inferred human interactome. This is one of many interaction networks. Note that it has a confidence score (cs-score) telling us how certain we about each interaction.

Heart disease proteins

We'll start out with a set of proteins known to be involved in atrioventricular canal morphology ("AACM"):

seeds <- c("ALDH1A2", "BMP2", "CXADR", "GATA4", "HAS2", "NF1", "NKX2-5", "PITX2", "PKD2", "RXRA", "TBX1", "TBX2", "ZFPM1", "ZFPM2")

TASK/REPORT QUESTION #1:

1: First you need to load the packages and data:

# Libraries needed
library(igraph)
library(msigdbr)
library(fgsea)

# The PPI database we will use is InWeb:
load(file='/home/projects/22140/exercise4.Rdata')

2: Extract all connections with a confidence score >= 0.2 and use igraph to create a graph with the (data via the graph_from_dataframe() function). No plotting needed now.


3: Next we will do the virtual pulldown as a two step process. Step 1: Extract sub-graph (aka virtual pulldown), Step 2: Filter the virtual pulldown (as discussed in lecture). For step 1:

  • You can use the `ego()` function to get neighbors of all seed genes. Here you need to specify mode = "all".
    • Hint: The unlist() function turns a list into a vector wile presevering all entries
  • You can use `induced_subgraph()` to extact a subnetwork based on a set of node ids.


4: For step2 you can use this function:

filter_virtual_pulldown <- function(subGraph, parentGraph, cutoff) {
  # require vertex attribute "names" in both graphs
  if (is.null(V(subGraph)$name) || is.null(V(parentGraph)$name))
    stop("Both graphs must have vertex 'name' attributes.")
  
  mode <- "all"
  
  vn <- V(subGraph)$name
  
  deg_full     <- degree(parentGraph, v = vn, mode = mode)
  deg_internal <- degree(subGraph,            mode = mode)
  
  res <- data.frame(
    node          = vn,
    deg_internal  = deg_internal,
    deg_full      = deg_full,
    frac_internal = ifelse(deg_full > 0, deg_internal / deg_full, NA_real_),
    stringsAsFactors = FALSE
  )
  
  nodes_to_keep <- res$node[!is.na(res$frac_internal) & res$frac_internal >= cutoff]
  
  filteredSubGraph <- induced_subgraph(subGraph, vids = V(subGraph)[name %in% nodes_to_keep])
  
  V(filteredSubGraph)$frac_internal <- res$frac_internal[match(V(filteredSubGraph)$name, res$node)]
  
  filteredSubGraph
}

Start by annotating the function with comments (like the "# require matching vertex names in both graphs" comment at the start of the function). You should annotate what each section of the function does. Hint: If needed you can run parts of the code to see what the intermediary results are.

5: Run the `filter_virtual_pulldown()` with these cutoffs: 0, 0.2, 0.5, 1

  • Look at the size of the filtered/scored networks to get an impression of how the network is narrowed down as the confidence score cut-off is raised
  • How many proteins (nodes) and how many interactions (edges) are reported when a 0.2 threshold is applied? How does that compare to the full network (no cutoff)? Explain difference.

Visualizing networks

REPORT QUESTION #2":

  • Use ggraph to plot the graph created by filtering with confidence score 0.2. Indicate which nodes are seed notes and which are neighbours.

Functional classification

Use loivant clustering with resolution = 0.4 to find sub-clusters in the network. For the 2-3 largest clusters do gene-set overrepressentation analysis with fora(). What background do you need to use? (you already have all the information needed to decide and use this background)

BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")

TASK/REPORT QUESTION #4:

  • Suggest the main biological function for each cluster. Do they make biological sense in the context of heart disease?

TASK/REPORT QUESTION #5:

  • Which gene is the most central (as defined by betweenness centrality) in the largest cluster? Does that gene make sense in the disease context?