Multiomics data integration exercise

Exercise written by: Lars Rønn Olsen

Learning objectives:

Overall objective: learn how to extract meaningful networks from human PPI data
1. Virtual pulldowns: sampling 1st order networks and filtering "sticky proteins"
2. Finding tightly connected clusters in larger networks
3. Using the DiscoNet package in R

Introduction

Today’s exercise requires the use of many of the methods you learned throughout this course. The objective is to identify aberrant, cancer-related protein complexes in follicular thyroid cancer, by analyzing gene expression and somatic mutation data from 10 patients. You will do this using the following workflow of skills you have learned in this course:

Calculating of log2FC from healthy to cancer tissue
Performing a virtual pulldown with the DiscoNet package
Performing protein complex detection using MCODE
Performing over representation analysis of GO terms using the fgsea package

Processing data and extracting dysregulated or mutated genes

First, you need to restart your R session! "Session" -> "Restart R"!

First, load the data for today's exercise:

load("/home/projects/22140/exercise10.Rdata")

Then, randomly pick a patient number. Let's use R to make sure it is random:

sample(1:10, 1)

For that patient calculate the log2 fold change of gene expression in cancer vs normal log2(cancer expression / normal expression). Consider genes with a log2 fold change either smaller than -1 (down regulated) or larger than 1 (up regulated) dysregulated. We also consider mutated genes (with a 1 in the mutation column) as aberrated (aka nonsynonymous), so you should keep these as well. Save a list of dysregulated and/or aberrated genes.

TASK/REPORT QUESTION #1:

What patient did you pick?
How many genes are up or down regulated?
Is the number of dysregulated genes what you would expect from a disease like cancer?
How many genes harbor nonsynonymous mutations (Note: we only report nonsynonymous mutations in this data. Synonymous mutations have already been removed)?
Theoretical question: Discuss the biological connection, or lack thereof, between nonsynonymous mutations and de-regulated of a) mutation and expression of the same gene and b) mutations in one gene and the expression of other genes. Use a maximum of 75 words.

For the patient sample you picked, make a node attribute table containing the following columns: “Gene”, “log2FC”, “somatic_mutations” (either 0 or 1) for all genes.

Virtual pulldown

Load the DiscoNet package and prepare the inweb database:

library(DiscoNet)

### Load translated database
load(file='/home/projects/22140/inweb_reduced.Rdata') # db

Then, use the virtual_pulldown() function that you used last week, to perform a virtual pulldown with all dysregulated and/or mutated genes.

TASK/REPORT QUESTION #2:

How does a virtual pulldown work?
What experiment does it simulate?
What database does it query?
What does the confidence score mean?
Is the default cutoff of 0.156 reasonable? Why/why not?

We changed the package based on your excellent feedback last week. Now, the virtual_pulldown() function produces an interaction table (in the $network object), with confidence scores as edge attributes and a node attribute table (in the $node_attributes object) with a seed indicator and a topological score for each node.

Use the merge() function to add log2FC and mutation status to the node attribute table. Here's a hint so you don't spend too much time on this:

node_attributes <- merge(x = node_attributes, y = pt, by.x = "nodes", by.y = "gene", all.x = TRUE)

Make sure you understand what goes on in the function above.

Use the function graph_from_data_frame() to make an igraph object. Remember to make it undirected and add node attributes.

Use the relevance_filtering() function to make 3 different versions of the pulldown, with cutoffs "0", "0.5", "0.8".

TASK/REPORT QUESTION #3:

How many nodes/edges are in your three pulldown networks?
Plot the 0.8 filtered pulldown network, colored by log2FC and mutated genes highlighted with a shape, and seed nodes highlighted in some other way. Make sure the plots added to the report are easy to read.
Does the networks look as you would expect in terms of which genes are up or down regulated or mutated? I.e. is the mutation status generally related to expression status? What would you expect?

For the following exercises use the network with a cutoff of 0.8

Community detection / protein complex inference

Run the community_detection() function with the MCODE algorithm with parameters D = 0.05, haircut = TRUE, fluff = FALSE. The resulting communities may represent protein complexes that could be causative or indicative of the disease.

TASK/REPORT QUESTION #4:

How many nodes/edges are in your top 25 scoring communities (note: you many not have this many, in which case just report nodes/edges for all of them)
How is the community score from MCODE calculate and what does it mean?
Have a quick look [here|https://en.wikipedia.org/wiki/Protein_complex]. What do you expect a protein complex to look like? How many proteins? What clustering coefficient (how densely clustered)? Why?
Based on node/edge counts alone, which of your communities could be complexes?

Gene ontology over-representation analysis

Now use the the fora() function from fgsea to perform an over-representation analysis on all the detected communities. Use the full list of genes from the exercise data as your background, and the biological process ontology.

library(fgsea)
library(msigdbr)
# First, fetch gene sets
BP_df = msigdbr(species = "human", category = "C5", subcategory = "BP")
BP_list = split(x = BP_df$gene_symbol, f = BP_df$gs_name)
# repeat the analysis below with the communities you deem to be potential protein complexes
fora(pathways = BP_list, genes = V(communities$communities[[1]])$name, universe = rownames(pt))

Note: If you get a timeout error just run the function again.

TASK/REPORT QUESTION #5:

Why are we asking you to use the "full list of genes" from the microarray as the background, and not all human genes? (Note that microarrays, while comprehensive, do not have probes for every single human gene).
What does an over-representation test do?
What is “over-represented” and where?

Interpreting the results of the over-representation analysis

Relate functions of the significantly over-represented GO terms in each of the top 25 protein complexes to the hallmarks of cancer. Keep in mind that neither all complexes, nor all GO terms are necessarily related to cancer.

TASK/REPORT QUESTION #6:

6.1) Do any of your complexes relate to one or more of the hallmarks of cancer? How? (You will probably need to search Google here – try [GO term]+cancer or even [GO term]+”follicular thyroid cancer”).
6.2) If you cannot reasonably relate some of the GO terms to a cancer hallmark (and it is not unlikely that you cannot!), why do you think these non-cancer related ontologies are over-represented?
6.3) Plot the three communities you find most interesting in the context of follicular thyroid cancer. Highlight log2FC, mutation status, and seed nodes
6.4) How do you think these dysregulated protein complexes may drive cancer pathogenesis?
6.5) For the largets of your three chosen protein complexes, pick the protein best suited as a drug target and explain why based on what you have learned about network topology.

DiscoNet2

Contents

Multiomics data integration exercise

Introduction

Processing data and extracting dysregulated or mutated genes

Virtual pulldown

Community detection / protein complex inference

Gene ontology over-representation analysis

Interpreting the results of the over-representation analysis

Navigation menu

DiscoNet2

Multiomics data integration exercise

Introduction

Processing data and extracting dysregulated or mutated genes

Virtual pulldown

Community detection / protein complex inference

Gene ontology over-representation analysis

Interpreting the results of the over-representation analysis

Navigation menu

Search